archive-access-discuss Mailing List for Web Archive Access Utilities (Page 31)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 29 30 31 32 33 .. 43 > >> (Page 31 of 43)

Re: [Archive-access-discuss] wayback deployment problem

From: Jian J. <jia...@gm...> - 2008-03-28 01:10:53

Hello, Brad,

Thanks for you reply.

Yes, you are right. I think the correct full URL is not correctly sent
to Wayback. But I still have no idea how to solve it.

The problem probably lies in the communication between the Apache2 and
Tomcat. Actually, we have AJP connectors setup on our back end servers
and we map the port 80 to 8080. Suppose the back end servers A and B,
and load balancer C.

When I tried the URL http://A/wayback-webapp-1.2.0/wayback/, I got a
successful message from access_log "GET /wayback-webapp-1.2.0/wayback/
HTTP/1.1" 200 3610.
But when I tried the URL http://C/wayback-webapp-1.2.0/wayback/, I got
an error message from access_log of Apache2 "GET
/wayback-webapp-1.2.0/wayback/ HTTP/1.1" 404 1042.
Obviously, the two requests are both successfully received by Apache2,
why the responses are different?

In the http.conf, I already added "JkMount /wayback* ajp13".

Since we already use the AJP, so I prefer to do modification based on
that. Would you please explain more clearly and detailedly what should
I do to ensure the correct requests are received by Wayback.

Thanks very much and best regards.

Jian

On Thu, Mar 27, 2008 at 1:06 PM, Brad Tofel <br...@ar...> wrote:
> Hi Jian,
>
>  The problem is probably that the Wayback needs to know the fully
>  qualified hostname where it is running, so it can generate links correctly.
>
>  To do the kind of setup you're trying to, I know of two solutions using
>  the current software:
>  1) use AJP to ensure that the requests are received by the Wayback with
>  the correct hostname, port, and context information.
>  2) use the "ProxyHost" (and possibly "ProxyPort") settings on the
>  "Connector" tag within tomcat's server.xml configuration file. This
>  allows you to explicitly set the values returned by
>  HttpServletRequest.getServer*().
>
>  These two settings do not allow as much flexibility (specifically with
>  proxying a different path from a front end node to a backend wayback
>  access point) so probably going forward we will change the software to
>  allow the AccessPoint URI to be set explicitly within the wayback
>  configuration.
>
>  Please let me know if you have questions on this, and how it works for you.
>
>  Brad
>
>
>
>  Jian Jiao wrote:
>  > Hello,
>  >
>  > I have a problem when I try to deploy Wayback.
>  > I have two servers running Wayback at the back end and a balancer
>  > controller at the front.
>  > My Wayback works well separatedly on the two back end servers.
>  > However, When I try to access Wayback using the URL of the balancer
>  > controller,
>  > there is always an error.
>  >
>  > It seems that Wayback does not know the access point.
>  >
>  > In the statement in index.jsp
>  > ArrayList<String> accessPoints = (ArrayList<String>) names
>  > accessPoints is always null.
>  >
>  > I already modified the wayback.xml to change the replayURIPrefix to
>  > the new domain.
>  >
>  > I don't know what's the wrong with it.  Can you please help me to
>  > figure out what is the problem?
>  >
>  > Any help or suggestions would be appreciated.
>  >
>  >
>
>

-- 
Best wishes,
Jian Jiao

Re: [Archive-access-discuss] wayback deployment problem

From: Brad T. <br...@ar...> - 2008-03-27 17:05:32

Hi Jian,

The problem is probably that the Wayback needs to know the fully 
qualified hostname where it is running, so it can generate links correctly.

To do the kind of setup you're trying to, I know of two solutions using 
the current software:
1) use AJP to ensure that the requests are received by the Wayback with 
the correct hostname, port, and context information.
2) use the "ProxyHost" (and possibly "ProxyPort") settings on the 
"Connector" tag within tomcat's server.xml configuration file. This 
allows you to explicitly set the values returned by 
HttpServletRequest.getServer*().

These two settings do not allow as much flexibility (specifically with 
proxying a different path from a front end node to a backend wayback 
access point) so probably going forward we will change the software to 
allow the AccessPoint URI to be set explicitly within the wayback 
configuration.

Please let me know if you have questions on this, and how it works for you.

Brad

Jian Jiao wrote:
> Hello,
>
> I have a problem when I try to deploy Wayback.
> I have two servers running Wayback at the back end and a balancer
> controller at the front.
> My Wayback works well separatedly on the two back end servers.
> However, When I try to access Wayback using the URL of the balancer
> controller,
> there is always an error.
>
> It seems that Wayback does not know the access point.
>
> In the statement in index.jsp
> ArrayList<String> accessPoints = (ArrayList<String>) names
> accessPoints is always null.
>
> I already modified the wayback.xml to change the replayURIPrefix to
> the new domain.
>
> I don't know what's the wrong with it.  Can you please help me to
> figure out what is the problem?
>
> Any help or suggestions would be appreciated.
>
>

[Archive-access-discuss] wayback deployment problem

From: Jian J. <jia...@gm...> - 2008-03-24 19:38:51

Hello,

I have a problem when I try to deploy Wayback.
I have two servers running Wayback at the back end and a balancer
controller at the front.
My Wayback works well separatedly on the two back end servers.
However, When I try to access Wayback using the URL of the balancer
controller,
there is always an error.

It seems that Wayback does not know the access point.

In the statement in index.jsp
ArrayList<String> accessPoints = (ArrayList<String>) names
accessPoints is always null.

I already modified the wayback.xml to change the replayURIPrefix to
the new domain.

I don't know what's the wrong with it.  Can you please help me to
figure out what is the problem?

Any help or suggestions would be appreciated.

-- 
Best wishes,
Jian

[Archive-access-discuss] wayback utf8 support

From: Lukáš M. <lma...@gm...> - 2008-03-20 08:37:05

Hello,

we've been using and testing Wayback for several years in WebArchiv.cz 
and we're familiar with the fact, that so far IA's done a lot of effort 
in i18n especially in last releases. In particular, we appreciate 
support for language properties and configuration of individual jsp 
pages, nevertheless we're still facing issues with utf-8 encoding. I'd 
like to ask for experiences from others (non-ascii countries) how they 
solved this issue.

In general, with a new release, we have to always make following changes 
(with assumption that we usually store our language properties in utf-8):

1. Convert all jsp into utf-8
2. Add meta tag "<meta equiv="Content-Type" content="text/html; 
charset=UTF-8">" to JSP in order to browser can recognize right encoding
3. Add directive <%@ page language="java" pageEncoding="utf-8" 
contentType="text/html;charset=utf-8"%>' to each JSP to say that server 
should send response in UTF-8
4. if we also want to send a unicode text from form to server we have to 
implement a filter that sets encoding to request 
(req.setCharacterEncoding(encoding);)

With respect to this changes, we're able to customize each release, 
however it might help to other non-english speaking countries to 
incorporate this into wayback.
Or is there any other intent how to treat this issue?

Thanks in advance for reply.

Best Regards
--
Lukas Matejka
WebArchiv.cz
CZ National Library

[Archive-access-discuss] Wayback index directory changing problem

From: Jian J. <jia...@gm...> - 2008-03-20 01:34:11

Hello, there,

I want to change the working directory of Wayback from /tmp to another
directory.

However, when I try to do that by modifying the configuration file
wayback.xml, every directory seems to work well except one /tmp/index.
When I change the /tmp/index to other place, Wayback just does not
work. How can I do that?

Any help would be appreciated.

-- 
Best wishes,
Jian Jiao

Re: [Archive-access-discuss] [ANN] wayback-1.2.0 released

From: Ignacio G. <igc...@gm...> - 2008-03-12 13:32:32

Hello Brad,

I just started "playing" with this new version of Wayback, and there is one
thing that seems very extrange to me.

On every page resource I visit, I always get the header information
plastered at the top of the page.
(i.e. HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Tue, 03 Oct 2000
07:31:49 GMT Connection: Keep-Alive Content-Length: 13027 Content-Type:
text/html Set-Cookie:
GWBSiteCookie=header%5Ftype=Text&mode=false&browser=Default&browser%5Fchecked=true&browser%5Fwidth=0;
path=/ Cache-control: private)

This information denotes the header information that was retrieved at the
time of crawl (as you can see by the date), the thing I do not understand is
why am I seeing it when I access a page via Wayback.
It appears at the very top, even over the TimeLine section.

Any ideas on why this might be, or how to get rid of it?

Thanks.


On 2/29/08, Brad Tofel <br...@ar...> wrote:
>
> Hi Thomas,
>
> Thanks for the kind feedback.
>
> Couple of suggestions, and also some follow-up questions interspersed:
>
> Thomas Beekman wrote:
> > Hi all,
> >
> > At the KB we are severely testing Wayback 1.2.0 at the moment. My first
> > impression is quite positive; many new functions are added, it is quite
> > easy to implement different modules for different access points and
> > several indexing threads can live side by side now.
> >
> > I have a few questions though. First of all, I'm experiencing errors
> > which did not occur in older versions; java.lang.OutOfMemoryError: GC
> > overhead limit exceeded. Does anyone know how to fix this?
> >
> >
> I haven't seen this before, and some quick google searches indicate it
> may be one of:
>
> A) a JVM problem (which JVM are you using?)
> B) too little heap space in the java startup arguments
> C) the wayback software doing lots of object creation+destruction.
>
> Since we have large installations in production at the IA, one using
> 700+ Collections and 1400+ AccessPoints. Note that these all use CDX
> indexes, which are more resource efficient. I'm hoping that C is not the
> problem, but we haven't yet needed to do a heavy optimization pass over
> the code, so it could be Wayback itself. Are you using IBM's JVM? Have
> you tried increasing the heap? If that doesn't address the problem, can
> you please send me a copy of your wayback.xml Spring configuration?
>
> > Second; when closing down Wayback in Tomcat, the lock file for the
> > localbdb is not erased. A restart is therefore not possible. Could this
> > be fixed so that if the webapp is closed down, the lock file is erased?
> >
> >
>
> On what platform (OS+JVM) are you running Wayback? Is the BDB index
> stored over NFS or another networked file system? I haven't experienced
> this problem on any of our systems -- the BDBJE just starts up, even
> with the lock file still existing. I haven't looked into this, but
> guessed that it was using the lock file via flock() type semantics,
> instead of using it's existence to indicate a lock. BDBJE may determine
> that the DB is on a remote system, where flock() semantics don't work,
> in which case it may be falling back to using the existence of the lock
> file to indicate usage..
>
> In any case, I've just implemented the "clean shutdown" processing in my
> development environment, but will probably hold off to do more testing
> before including it in a release.
>
> We are preparing a 1.2.1 release which addresses a couple bugs
> discovered by folks in the field, but are holding this release for
> feedback from one more user having trouble reading some ARC files.
>
> > Third; with a few websites the timeline GUI is scrambled. I get a full
> > yellow screen with on every line a mark. After scrolling down that page,
> > the website is presented normally. This is not the case with every
> > website.
> >
> >
> Yes, the css implementation in the current timeline is prone to
> inheriting some styles from some web pages. Could you please send me a
> few example pages on the live web that demonstrate the problem you're
> seeing?
>
> > My fourth and last problem is in the configuration. I would like to do
> > some tests using the remote NutchWAX search, but there is not a clear
> > manual of how to implement this precisely, which beans to use for
> > example. Does anyone have a good example for me?
> >
> >
>
> Setting up a collection with this bean:
>
> <property name="resourceIndex">
> <bean class="org.archive.wayback.resourceindex.NutchResourceIndex"
> init-method="init">
> <property name="searchUrlBase"
> value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" />
> <property name="maxRecords" value="100" />
> </bean>
> </property>
>
> Should do the trick. Note that if using Archival URL mode, you should be
> sure to set the maxRecords property on the RequestParser to the same
> value for maxRecords.. This may be a bug -- would be more friendly to
> use the min() of both values..
>
> <property name="parser">
> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
> init-method="init">
> <property name="maxRecords" value="100" />
> <property name="earliestTimestamp" value="1996" />
> </bean>
> </property>
>
>
> Hopefully this works for you, and please let me know about the questions
> above.
>
> Brad
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] url bounded by timestamp: explanation

From: Miguel C. <mig...@fc...> - 2008-03-08 11:01:26

Hi Brad,

The problem is more like a bug than a future feature that should be
implemented. You say "The Wayback will return the document that is closest
to the current document being viewed.". This is the expected behavior, but
this is only partially true. Look at this example.

A version of a page with timestamp 20080228191941 is presented:

http://t3.tomba.fccn.pt:8080/wayback/wayback/20080228191941/http://xldb.fc.u
l.pt/daniel/

This page contains an embedded image with a later timestamp 20080228192041
(1 more minute), but the Wayback submits a query to find the versions of
this image from a minimum bound (20010101000000) to a maximum bound
(timestamp of the source page = 20080228191941):

http://t3.tomba.fccn.pt:8080/nutchwax/opensearch?query=date%3A20010101000000
-20080228191941+exacturl%3Ahttp%3A%2F%2Fxldb.fc.ul.pt%2Fdaniel%2Fimages%2Fre
tratoDanielGomes.jpg 

The desired version is excluded by the date range constraint in the query,
since the image has 1 more minute. The Wayback ONLY THEN computes the
closest version from the versions with timestamp up to 20080228192041. 
To achieve the desired behavior and include the embedded image in the page
using the closest date, the query should be upper limited by a broader date
range or alternatively the date range constraint should be removed.


Currently in the code, the getRequestUrl() method on the NutchResourceIndex
class receives a WaybackRequest parameter, containing 3 values: 

startDateStr (minimum bound) = 20010101000000
exactDateStr (URL timestamp) = 20080228191945
endDateStr (maximum bound = current time) = 20080305190212


These values are used to compute the date range constraint at line 287 in
the NutchResourceIndex class:

	if ((endDateStr == null || endDateStr.length() == 0)
				&& exactDateStr != null &&
exactDateStr.length() > 0) {
           ms.append("date%3A").append(exactDateStr).append('+');
    	   
       } else {
           ms.append("date%3A").append(startDateStr).append('-').append(
               exactDateStr != null ? exactDateStr :
endDateStr).append('+');
       }


This date range constraint is computed as "startDateStr-exactDateStr", but
it should be "startDateStr-endDateStr" or "startDateStr-exactDateStr +
xTime".


Best regards,

-- Miguel Costa

 

-----Original Message-----
From: arc...@li...
[mailto:arc...@li...] On Behalf Of
Brad Tofel
Sent: terça-feira, 4 de Março de 2008 23:13
To: Daniel Gomes
Cc: archive-access discussion list
Subject: Re: [Archive-access-discuss] url bounded by timestamp: explanation

Hi Daniel,

Thanks for the elaboration and the excellent suggestions. We've been
discussing adding functionality to Wayback to allow users to target a
specific date they want to stay "near" within a replay session. 
Currently when retrieving an embedded object for a web page, or when
navigating between two archived web page, the Wayback will return the
document that is closest to the current document being viewed. We'd like to
add the capability for users to specify a specific date, as well as a
maximum range before and after that date to stay within for these embedded
requests, and for navigations.

In somewhat more detail, we plan to expand greatly the "in page presence" of
the Wayback software, which in this particular case would mean including a
banner or additional element in the page that would allow users to
temporarily expand the maximum range of embedded elements in a specific page
to potentially allow replay of captures that were archived, but are outside
the standard maximum range.

I think this is the same functionality you're suggesting, and we're hoping
to have this in the 1.4 release, in a 2-3 month time frame. 
Wayback HEAD may include this functionality before that, and I'll let you
know how that progresses. However, in the context you're using Wayback, with
a Nutch ResourceIndex, this may require more functionality within Nutch as
well. I'm not sure what the schedule might be for that, but again will keep
you posted.

Please let me know if I've misunderstood your suggestion, and the
functionality we've discussed is not the same as your suggestions.

Brad

Daniel Gomes wrote:
> The last email from my colleague Miguel Costa might have been a bit 
> confusing. I will try to clarify the problem we identified because it 
> looked quite important to us.
>
> We noticed that the wayback machine issues a query ranged by date to 
> find embedded objects, such as images in an HTML page.
>
> Our first question is "Why is the query ranged by date instead of 
> being restricted to the collection identifier?".
>
> A search by collection identifier would be more efficient because the 
> search would be based on an exact match of the collection id and would 
> present the images that most likely belong to that page.
> One may argue that this way if the image was not crawled in the last 
> collection it would not be presented in the page. While using a date 
> range query the image would still be included. The problem we see in 
> this approach, is that we might be including images that although 
> exist in the archive, were never published in the page.
>
> This situation lead to our second doubt:
>
> The date range issued in the query is from a static date of the first 
> collection (e.g. 20010101000000) to the timestamp of the page (e.g.
> 20080218201945).
>
> We believe this situation leads to several problems:
>
> 1. The date range of the query is unnecessarily broad, if we are 
> looking for the images embedded in a page crawled in 2008, looking for 
> them since 2001 seems excessive.
>
> 2. Pages can be presented containing old images that were never 
> published together (problem mentioned above)
>
> 3. Embedded images that have timestamps posterior to the page date 
> (even some minutes later) are not found and not rendered along with the
page.
> Notice, that pages must be crawled first to extract links to the 
> embedded images, so most images will have a date later than the page 
> and will not be presented by the wayback. In theory, it makes sense to 
> not present pages including contents "from the future", but 
> considering that crawls can not be executed instantly, using a sliding 
> time window seems to be more adequate to find embedded objects and even
links to other pages.
>
> We propose that the wayback/nutchwax should be configurable to:
>
> 1. Find contents to be rendered together based on the collection id 
> or;
>
> 2.Find contents within a configurable date range centered on the date 
> of the page. Say if the page date is 2008/01/03, we would consider 
> that the embedded URLs crawled 3 days before and after this date could 
> be rendered along with it. Notice, that if one is performing a crawl 
> every
> 3 months, the timespan could be 1 month instead of 3 days. The 
> timespan should be configured according to the duration and frequency of
the crawls.
> We believe that contents from previous crawls should not be rendered 
> together with a page.
>
> We would deeply appreciate that you validate our conclusions and gave 
> us feedback about this issue.
>
> Best regards,
> /Daniel Gomes
> Portuguese web archive
> http://xldb.fc.ul.pt/daniel/
>
>
> *From:* arc...@li...
> [mailto:arc...@li...] *On 
> Behalf Of *Miguel Costa
> *Sent:* segunda-feira, 3 de Março de 2008 19:15
> *To:* arc...@li...
> *Subject:* [Archive-access-discuss] url bounded by timestamp
>
> Hi,
>  
> When a page is presented in the wayback machine, the linked images 
> (and other resources) are searched to be presented also.
> The problem is that my wayback machine is searching using the nutchwax 
> index, through the opensearch servlet, and the nutchwax bounds the 
> search of the images (resources) by date (the timestamp of the source
page):
>  
> eg: date:20010101000000-20080218201945
> exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js
>  
> after the url be called inside the source page: 
> http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xld
> b.fc.u
> l.pt/daniel/scripts/statCounter.js
>  
> If the statCounter.js, for instance, has a higher timestamp (eg: 
> 20080218201955), that is usual, this resource is not found. 
> Does anyone know why these nutchwax searches don't use the collection 
> id instead the timestamp, to find the linked images (resources). Does 
> anyone know a solution for the problem?
>  
>
> Regards,
>
>  
>
> -- Miguel Costa
>
> Portuguese Web Archive
>
>
> --
> /Daniel Gomes
> FCCN
> Av. do Brasil, n.º 101
> 1700-066 Lisboa
> Tel.: +351 21 8440190
> Fax: +351 218472167
> www.fccn.pt
>  
> Aviso de Confidencialidade
>  
> Esta mensagem é exclusivamente destinada ao seu destinatário, podendo 
> conter informação CONFIDENCIAL, cuja divulgação está expressamente 
> vedada nos termos da lei. Caso tenha recepcionado indevidamente esta 
> mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta 
> via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de 
> imediato. This message is intended exclusively for its addressee. It 
> may contain CONFIDENTIAL information protected by law. If this message 
> has been received by error, please notify us via e-mail or by 
> telephone +351 218440100 and delete it immediately.
>
>
>
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft Defy all challenges.
Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] wayback errors

From: Arturas S. <Art...@si...> - 2008-03-06 12:12:18

Hello,

we are using NetarchiveSuite + Wayback-1.2.0 + Nutchwax in National 
library of Lithuania. Our archive can be accessed at 
http://eia.libis.lt:8080/archyvas/viesas/
Most pages are displayed correctly, but on some pages we have problems 
with wayback.
Links 
http://eia.libis.lt:8080/archyvas/viesas/20080208070802/http://www.arzinai.lt/,
http://eia.libis.lt:8080/archyvas/viesas/20080216094637/http://www.valstietis.lt/ 

displays error

HTTP Status 500 - 
type Exception report
message 
description The server encountered an internal error () that prevented it 
from fulfilling this request.
exception 
java.lang.StringIndexOutOfBoundsException: String index out of range: -3
                 java.lang.String.substring(String.java:1768)
 org.archive.wayback.replay.TagMagix.markupStyleUrls(TagMagix.java:176)
.....


Other problem is that old data harvested with Nedlib harvester and 
converted to arc.gz format with Nedlib2Arc tool not always displayed 
correctly. When searching www.lzinios.lt results from 2008 year is 
displayed good, but links from 2002 year doesn't work.

Also when using server side rendering then pages with frameset can't be 
displayed.

One more feature would be great if wayback could access and index files 
from resourceStore dataDir subdirectories, because we will have very many 
arc files (>10000) and there is problem to put symbolic links to them in 
one directory.


Regards,
Artūras Sagidulinas

[Archive-access-discuss] does nutchWax indexes wARC files?

From: Daniel G. <dan...@fc...> - 2008-03-05 14:37:07

Hi everyone.

Heritrix and the Wayback machine already support ARC files.

However, we haven't found any documentation about the capability of 
NutchWax to index wARC files. Is this supported?

Thank you for your attention.

Best regards,

-- 
/Daniel Gomes
FCCN
Av. do Brasil, n.º 101
1700-066 Lisboa 
Tel.: +351 21 8440190 
Fax: +351 218472167 
www.fccn.pt
 
Aviso de Confidencialidade
 
Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.

Re: [Archive-access-discuss] url bounded by timestamp: explanation

From: Brad T. <br...@ar...> - 2008-03-04 23:09:52

Hi Daniel,

Thanks for the elaboration and the excellent suggestions. We've been 
discussing adding functionality to Wayback to allow users to target a 
specific date they want to stay "near" within a replay session. 
Currently when retrieving an embedded object for a web page, or when 
navigating between two archived web page, the Wayback will return the 
document that is closest to the current document being viewed. We'd like 
to add the capability for users to specify a specific date, as well as a 
maximum range before and after that date to stay within for these 
embedded requests, and for navigations.

In somewhat more detail, we plan to expand greatly the "in page 
presence" of the Wayback software, which in this particular case would 
mean including a banner or additional element in the page that would 
allow users to temporarily expand the maximum range of embedded elements 
in a specific page to potentially allow replay of captures that were 
archived, but are outside the standard maximum range.

I think this is the same functionality you're suggesting, and we're 
hoping to have this in the 1.4 release, in a 2-3 month time frame. 
Wayback HEAD may include this functionality before that, and I'll let 
you know how that progresses. However, in the context you're using 
Wayback, with a Nutch ResourceIndex, this may require more functionality 
within Nutch as well. I'm not sure what the schedule might be for that, 
but again will keep you posted.

Please let me know if I've misunderstood your suggestion, and the 
functionality we've discussed is not the same as your suggestions.

Brad

Daniel Gomes wrote:
> The last email from my colleague Miguel Costa might have been a bit
> confusing. I will try to clarify the problem we identified because it looked
> quite important to us.
>
> We noticed that the wayback machine issues a query ranged by date to find
> embedded objects, such as images in an HTML page. 
>
> Our first question is "Why is the query ranged by date instead of being
> restricted to the collection identifier?".
>
> A search by collection identifier would be more efficient because the search
> would be based on an exact match of the collection id and would present the images that most
> likely belong to that page.
> One may argue that this way if the image was not crawled in the last
> collection it would not be presented in the page. While using a date range
> query the image would still be included. The problem we see in this approach, is that
> we might be including images that although exist in the archive, were never
> published in the page.
>
> This situation lead to our second doubt:
>
> The date range issued in the query is from a static date of the first
> collection (e.g. 20010101000000) to the timestamp of the page (e.g. 
> 20080218201945).
>
> We believe this situation leads to several problems:
>
> 1. The date range of the query is unnecessarily broad, if we are looking for
> the images embedded in a page crawled in 2008, looking for them since 2001
> seems excessive.
>
> 2. Pages can be presented containing old images that were never published
> together (problem mentioned above)
>
> 3. Embedded images that have timestamps posterior to the page date (even
> some minutes later) are not found and not rendered along with the page. 
> Notice, that pages must be crawled first to extract links to the embedded
> images, so most images will have a date later than the page and will not be
> presented by the wayback. In theory, it makes sense to not present pages including contents "from the
> future", but considering that crawls can not be executed instantly, using a
> sliding time window seems to be more adequate to find embedded objects and
> even links to other pages.
>
> We propose that the wayback/nutchwax should be configurable to:
>
> 1. Find contents to be rendered together based on the collection id or; 
>
> 2.Find contents within a configurable date range centered on the date of the
> page. Say if the page date is 2008/01/03, we would consider that the
> embedded URLs crawled 3 days before and after this date could be rendered
> along with it. Notice, that if one is performing a crawl every
> 3 months, the timespan could be 1 month instead of 3 days. The timespan
> should be configured according to the duration and frequency of the crawls.
> We believe that contents from previous crawls should not be rendered
> together with a page.
>
> We would deeply appreciate that you validate our conclusions and gave us
> feedback about this issue.
>
> Best regards,
> /Daniel Gomes
> Portuguese web archive
> http://xldb.fc.ul.pt/daniel/
>
>
> *From:* arc...@li...
> [mailto:arc...@li...] *On Behalf Of
> *Miguel Costa
> *Sent:* segunda-feira, 3 de Março de 2008 19:15
> *To:* arc...@li...
> *Subject:* [Archive-access-discuss] url bounded by timestamp
>
> Hi,
>  
> When a page is presented in the wayback machine, the linked images (and
> other resources) are searched to be presented also.
> The problem is that my wayback machine is searching using the nutchwax
> index, through the opensearch servlet, and the nutchwax bounds the search of
> the images (resources) by date (the timestamp of the source page):
>  
> eg: date:20010101000000-20080218201945
> exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js
>  
> after the url be called inside the source page: 
> http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u
> l.pt/daniel/scripts/statCounter.js
>  
> If the statCounter.js, for instance, has a higher timestamp (eg: 
> 20080218201955), that is usual, this resource is not found. 
> Does anyone know why these nutchwax searches don't use the collection id
> instead the timestamp, to find the linked images (resources). Does anyone
> know a solution for the problem? 
>  
>
> Regards,
>
>  
>
> -- Miguel Costa
>
> Portuguese Web Archive
>
>
> --
> /Daniel Gomes
> FCCN
> Av. do Brasil, n.º 101
> 1700-066 Lisboa
> Tel.: +351 21 8440190
> Fax: +351 218472167
> www.fccn.pt
>  
> Aviso de Confidencialidade
>  
> Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter
> informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos
> termos da lei. Caso tenha recepcionado indevidamente esta mensagem,
> solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o
> telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This
> message is intended exclusively for its addressee. It may contain
> CONFIDENTIAL information protected by law. If this message has been received
> by error, please notify us via e-mail or by telephone +351 218440100 and
> delete it immediately.
>
>
>
>
>

Re: [Archive-access-discuss] link between heritrix and wayback

From: Brad T. <br...@ar...> - 2008-03-04 21:43:24

Hi Arnaud,

This is a good question, which has come up several times in the past 
month or so, which hopefully means we'll be addressing it better in the 
1.4 release, 2-3 month time frame. You do need to put all the ARC/WARC 
files in a single directory to serve them all from a single collection. 
This can be accomplished by copying/moving them to a single directory, 
or by using symbolic links.

As another, more complex, alternative you could set up an ARC Proxy and 
expose all the ARCs in their various directories via HTTP 1.1, possibly 
just for access to the local machine, which would then be configured to 
use a RemoteResourceStore. The primary downside to doing this is that 
you need to manage updating your index yourself.

Or, possibly better, would be to create multiple collections, each with 
a distinct LocalResourceStore pointing at the correct directory 
containing the appropriate ARC files. This would also require creating a 
separate ResourceIndex for each collection, and a separate AccessPoint 
for each of those collections. IA and other users have been doing this 
extensively in our deployments, but the down side is that you won't be 
able to search across all collections with a single query, but that may 
be what you want.

So, the simplest is to move the files, or use symbolic links, but there 
are other options that can accomplish this.

Brad

Arn...@he... wrote:
> Hello,
> sorry certainly for this stupid question but after spending time into 
> heritrix manuals, wayback manuals  and  mailing list archive , I hope 
> users of this mailing can help me!
> 1/ In heritrix Arcs files are created in 'arcs' directories under each 
> different Job. So  several directories.
> 2/ In 'wayback.xml' I have to define the 'dataDir'. So one directory.
> How to organized my arcs files in one directory to be used by the 
> wayback machine?
> Do I need to regularly copy the arcs files in a specific directory?
>
> Currently I tested by setting dataDir to one of my job arcs directory 
> but I obtain this error message when I hit the 'take me back' button
>
>
>  Etat HTTP 404 - /wayback-webapp-1.2.0/query
>
> I haven't error messages in tomcat log file.
>
> Arnaud.
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] url bounded by timestamp: explanation

From: Daniel G. <dan...@fc...> - 2008-03-04 17:55:13

The last email from my colleague Miguel Costa might have been a bit
confusing. I will try to clarify the problem we identified because it looked
quite important to us.

We noticed that the wayback machine issues a query ranged by date to find
embedded objects, such as images in an HTML page.

Our first question is "Why is the query ranged by date instead of being
restricted to the collection identifier?".

A search by collection identifier would be more efficient because the search
would be based on an exact match of the collection id and would present the images that most
likely belong to that page.
One may argue that this way if the image was not crawled in the last
collection it would not be presented in the page. While using a date range
query the image would still be included. The problem we see in this approach, is that
we might be including images that although exist in the archive, were never
published in the page.

This situation lead to our second doubt:

The date range issued in the query is from a static date of the first
collection (e.g. 20010101000000) to the timestamp of the page (e.g.
20080218201945).

We believe this situation leads to several problems:

1. The date range of the query is unnecessarily broad, if we are looking for
the images embedded in a page crawled in 2008, looking for them since 2001
seems excessive.

2. Pages can be presented containing old images that were never published
together (problem mentioned above)

3. Embedded images that have timestamps posterior to the page date (even
some minutes later) are not found and not rendered along with the page.
Notice, that pages must be crawled first to extract links to the embedded
images, so most images will have a date later than the page and will not be
presented by the wayback. In theory, it makes sense to not present pages including contents "from the
future", but considering that crawls can not be executed instantly, using a
sliding time window seems to be more adequate to find embedded objects and
even links to other pages.

We propose that the wayback/nutchwax should be configurable to:

1. Find contents to be rendered together based on the collection id or;

2.Find contents within a configurable date range centered on the date of the
page. Say if the page date is 2008/01/03, we would consider that the
embedded URLs crawled 3 days before and after this date could be rendered
along with it. Notice, that if one is performing a crawl every
3 months, the timespan could be 1 month instead of 3 days. The timespan
should be configured according to the duration and frequency of the crawls.
We believe that contents from previous crawls should not be rendered
together with a page.

We would deeply appreciate that you validate our conclusions and gave us
feedback about this issue.

Best regards,
/Daniel Gomes
Portuguese web archive
http://xldb.fc.ul.pt/daniel/

*From:* arc...@li...
[mailto:arc...@li...] *On Behalf Of
*Miguel Costa
*Sent:* segunda-feira, 3 de Março de 2008 19:15
*To:* arc...@li...
*Subject:* [Archive-access-discuss] url bounded by timestamp

Hi,

When a page is presented in the wayback machine, the linked images (and
other resources) are searched to be presented also.
The problem is that my wayback machine is searching using the nutchwax
index, through the opensearch servlet, and the nutchwax bounds the search of
the images (resources) by date (the timestamp of the source page):

eg: date:20010101000000-20080218201945
exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js

after the url be called inside the source page:
http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u
l.pt/daniel/scripts/statCounter.js

If the statCounter.js, for instance, has a higher timestamp (eg:
20080218201955), that is usual, this resource is not found.
Does anyone know why these nutchwax searches don't use the collection id
instead the timestamp, to find the linked images (resources). Does anyone
know a solution for the problem?

Regards,

-- Miguel Costa

Portuguese Web Archive

--
/Daniel Gomes
FCCN
Av. do Brasil, n.º 101
1700-066 Lisboa
Tel.: +351 21 8440190
Fax: +351 218472167
www.fccn.pt

Aviso de Confidencialidade

Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter
informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos
termos da lei. Caso tenha recepcionado indevidamente esta mensagem,
solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o
telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This
message is intended exclusively for its addressee. It may contain
CONFIDENTIAL information protected by law. If this message has been received
by error, please notify us via e-mail or by telephone +351 218440100 and
delete it immediately.

--
/Daniel Gomes
FCCN
Av. do Brasil, n.º 101
1700-066 Lisboa
Tel.: +351 21 8440190
Fax: +351 218472167
www.fccn.pt

Aviso de Confidencialidade

Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately.

[Archive-access-discuss] url bounded by timestamp

From: Miguel C. <mig...@fc...> - 2008-03-03 19:15:32

Hi,
 
When a page is presented in the wayback machine, the linked images (and
other resources) are searched to be presented also.
The problem is that my wayback machine is searching using the nutchwax
index, through the opensearch servlet, and the nutchwax bounds the search of
the images (resources) by date (the timestamp of the source page):
 
eg: date:20010101000000-20080218201945
exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js
 
after the url be called inside the source page:
http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u
l.pt/daniel/scripts/statCounter.js

 
If the statCounter.js, for instance, has a higher timestamp (eg:
20080218201955), that is usual, this resource is not found. 
Does anyone know why these nutchwax searches don't use the collection id
instead the timestamp, to find the linked images (resources). Does anyone
know a solution for the problem? 
 

Regards,

 

-- Miguel Costa

Portuguese Web Archive

Re: [Archive-access-discuss] org.archive.io.NoGzipMagicException

From: Miguel C. <mig...@fc...> - 2008-03-03 16:05:35

Hi,

I found the bug in the ImportArcs class. This bug makes the import command
to build segments with wrong arc names.

The map method receives a "value" parameter containing an ARCRecord.
This ARCRecord has the url, arc filename and offset. All values are used in
this method, except the arc filename that is set on the first time the map
method is called. So, when a thread is working over a new arc file, the
output for the index will refer the old arc filename. 

The bug occur at line 301 ("checkArcName(rec);"). I commented line 545 of
the checkArcName() method to fix the bug.

Regards,

-- Miguel Costa
Portuguese Web Archive

 

-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...] 
Sent: quinta-feira, 28 de Fevereiro de 2008 11:44
To: 'Brad Tofel'
Cc: 'Daniel Gomes'
Subject: FW: [Archive-access-discuss] org.archive.io.NoGzipMagicException

Hi,

I don't know if you found anything else about this problem, but I found the
reason of the problem.
The index has bad references for the ARC files. The offsets returned are ok
but not the ARC files, usually one ARC filename behind:
e.g. returns IAH-20080218190013-00000-T4.arc.gz instead of
IAH-20080218190013-00001-T4.arc.gz

You can see the ARC file and offset debugging the NutchResourceIndex (line
122: document =  getHttpDocument(requestUrl)) or much more simple, by
submitting the url in the browser 

e.g.
http://localhost:8080/nutchwax/opensearch?query=date%3A20010101000000-200802
18190351+exacturl%3Ahttp%3A%2F%2Fwww.icat.fc.ul.pt%2Fstyles.css&hitsPerPage=
1000&start=0&dedupField=site&hitsPerDup=1000&hitsPerSite=1000


The wayback machine is using the nutchwax index through the opensearch link.
The nutchwax send the XML information from url match. This shows an ARC file
and an offset, but if you use the ARC READER over all ARCs to find this
offset:
e.g. find offset 24042995

arcreader `ls *.arc.gz` | grep 24042995
20080218190054 194.117.42.131
http://www.icat.fc.ul.pt/images/background_voltar.gif image/gif - - 24042995
388 IAH-20080218190013-00002-T4

you get an ARC file different from the expected. In the cases where the
NoGzipMagicException doesn't occur, the ARC file is the correct.


This occurs with one or more reduce tasks in hadoop, so it doesn't seems a
problem from the merge command.
Do you have any idea to solve this?

Regards,



-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...]
Sent: sexta-feira, 22 de Fevereiro de 2008 15:22
To: 'Brad Tofel'
Subject: RE: [Archive-access-discuss] org.archive.io.NoGzipMagicException

More help to the problem.

I'm debuging the code using the org.archive.io.arc.ARCReader class in the
command line.
I can parse and dump all URLs from the arc.gz file.
When I use the offset returned by this dump I can see that the file is OK.

e.g.

/home/nutchwax/heritrix-1.12.1/src/scripts/arcreader -o 2332619
/home/nutchwax/arcs/IAH-20080123100910-00023-thessalian.arc.gz
20080123110842 70.85.38.82
http://www.gastronomias.com/moirasencantadas/imagens/logo.jpg image/jpeg -
4AN5CYCD3OYOZH7ZMOMJEKR37NTTXLT6 2332629 7722
IAH-20080123100910-00023-thessalian


When I put another offset I get the same exception:
	Exception in thread "main" java.io.IOException: Not in GZIP format
        at
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)


So, the problem seems to be in the creation of the index because the offsets
are computed wrong.



-----Original Message-----
From: Brad Tofel [mailto:br...@ar...]
Sent: quinta-feira, 21 de Fevereiro de 2008 19:52
To: Miguel Costa
Subject: Re: [Archive-access-discuss] org.archive.io.NoGzipMagicException

Darn. The problem didn't surface given the small input you sent.. Ran into
"Unepected End of ZLIB input stream" before the problem you are seeing.

Is there someplace online where you can post the entire file so I can
download it and examine it?

I should be able to receive a 100MB (how large is the original?) attachment
as well, if sending the whole file via email is an option for you.

Thanks,

Brad

Miguel Costa wrote:

Re: [Archive-access-discuss] [ANN] wayback-1.2.0 released

From: Brad T. <br...@ar...> - 2008-03-01 02:08:51

Hi Thomas,

Thanks for the kind feedback.

Couple of suggestions, and also some follow-up questions interspersed:

Thomas Beekman wrote:
> Hi all,
>
> At the KB we are severely testing Wayback 1.2.0 at the moment. My first
> impression is quite positive; many new functions are added, it is quite
> easy to implement different modules for different access points and
> several indexing threads can live side by side now.
>
> I have a few questions though. First of all, I'm experiencing errors
> which did not occur in older versions; java.lang.OutOfMemoryError: GC
> overhead limit exceeded. Does anyone know how to fix this?
>
>   
I haven't seen this before, and some quick google searches indicate it 
may be one of:

A) a JVM problem (which JVM are you using?)
B) too little heap space in the java startup arguments
C) the wayback software doing lots of object creation+destruction.

Since we have large installations in production at the IA, one using 
700+ Collections and 1400+ AccessPoints. Note that these all use CDX 
indexes, which are more resource efficient. I'm hoping that C is not the 
problem, but we haven't yet needed to do a heavy optimization pass over 
the code, so it could be Wayback itself. Are you using IBM's JVM? Have 
you tried increasing the heap? If that doesn't address the problem, can 
you please send me a copy of your wayback.xml Spring configuration?

> Second; when closing down Wayback in Tomcat, the lock file for the
> localbdb is not erased. A restart is therefore not possible. Could this
> be fixed so that if the webapp is closed down, the lock file is erased?
>
>   

On what platform (OS+JVM) are you running Wayback? Is the BDB index 
stored over NFS or another networked file system? I haven't experienced 
this problem on any of our systems -- the BDBJE just starts up, even 
with the lock file still existing. I haven't looked into this, but 
guessed that it was using the lock file via flock() type semantics, 
instead of using it's existence to indicate a lock. BDBJE may determine 
that the DB is on a remote system, where flock() semantics don't work, 
in which case it may be falling back to using the existence of the lock 
file to indicate usage..

In any case, I've just implemented the "clean shutdown" processing in my 
development environment, but will probably hold off to do more testing 
before including it in a release.

We are preparing a 1.2.1 release which addresses a couple bugs 
discovered by folks in the field, but are holding this release for 
feedback from one more user having trouble reading some ARC files.

> Third; with a few websites the timeline GUI is scrambled. I get a full
> yellow screen with on every line a mark. After scrolling down that page,
> the website is presented normally. This is not the case with every
> website.
>
>   
Yes, the css implementation in the current timeline is prone to 
inheriting some styles from some web pages. Could you please send me a 
few example pages on the live web that demonstrate the problem you're 
seeing?

> My fourth and last problem is in the configuration. I would like to do
> some tests using the remote NutchWAX search, but there is not a clear
> manual of how to implement this precisely, which beans to use for
> example. Does anyone have a good example for me?
>
>   

Setting up a collection with this bean:

<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" 
init-method="init">
<property name="searchUrlBase" 
value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" />
<property name="maxRecords" value="100" />
</bean>
</property>

Should do the trick. Note that if using Archival URL mode, you should be 
sure to set the maxRecords property on the RequestParser to the same 
value for maxRecords.. This may be a bug -- would be more friendly to 
use the min() of both values..

<property name="parser">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
init-method="init">
<property name="maxRecords" value="100" />
<property name="earliestTimestamp" value="1996" />
</bean>
</property>

Hopefully this works for you, and please let me know about the questions 
above.

Brad

[Archive-access-discuss] index ratio

From: Miguel C. <mig...@fc...> - 2008-02-28 17:49:58

Hi,
 
I would like to know the ratio between (index size)/(collection size) for
collections larger than 1 TB.
My objective is to have all index in memory, so having I X GB of memory,
what is the maximum size of a collection I can index?
Anyone can give me some numbers?
 
Regards,
 
 
--

Miguel Costa

FCCN-Fundação para a Computação Científica Nacional Av. do Brasil, n.º 101

1700-066 Lisboa

Tel.: +351 21 8440190

Fax: +351 218472167

 
<file:///C:/Documents%20and%20Settings/mcosta/Application%20Data/Microsoft/S
ignatures/www.fccn.pt> www.fccn.pt

Aviso de Confidencialidade

Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter
informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos
termos da lei. Caso tenha recepcionado indevidamente esta mensagem,
solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o
telefone +351 218440100 devendo apagar o seu conteúdo de imediato. 

This message is intended exclusively for its addressee. It may contain
CONFIDENTIAL information protected by law. If this message has been received
by error, please notify us via e-mail or by telephone +351 218440100 and
delete it immediately.

[Archive-access-discuss] link between heritrix and wayback

From: <Arn...@he...> - 2008-02-21 12:43:03

Attachments: Arnaud.Gaudinat.vcf

Hello,
sorry certainly for this stupid question but after spending time into 
heritrix manuals, wayback manuals  and  mailing list archive , I hope 
users of this mailing can help me!
1/ In heritrix Arcs files are created in 'arcs' directories under each 
different Job. So  several directories.
2/ In 'wayback.xml' I have to define the 'dataDir'. So one directory.
How to organized my arcs files in one directory to be used by the 
wayback machine?
Do I need to regularly copy the arcs files in a specific directory?

Currently I tested by setting dataDir to one of my job arcs directory 
but I obtain this error message when I hit the 'take me back' button


  Etat HTTP 404 - /wayback-webapp-1.2.0/query

I haven't error messages in tomcat log file.

Arnaud.

Re: [Archive-access-discuss] [ANN] wayback-1.2.0 released

From: Thomas B. <Tho...@KB...> - 2008-02-21 11:48:41

Hi all,

At the KB we are severely testing Wayback 1.2.0 at the moment. My first
impression is quite positive; many new functions are added, it is quite
easy to implement different modules for different access points and
several indexing threads can live side by side now.

I have a few questions though. First of all, I'm experiencing errors
which did not occur in older versions; java.lang.OutOfMemoryError: GC
overhead limit exceeded. Does anyone know how to fix this?

Second; when closing down Wayback in Tomcat, the lock file for the
localbdb is not erased. A restart is therefore not possible. Could this
be fixed so that if the webapp is closed down, the lock file is erased?

Third; with a few websites the timeline GUI is scrambled. I get a full
yellow screen with on every line a mark. After scrolling down that page,
the website is presented normally. This is not the case with every
website.

My fourth and last problem is in the configuration. I would like to do
some tests using the remote NutchWAX search, but there is not a clear
manual of how to implement this precisely, which beans to use for
example. Does anyone have a good example for me?

Keep up the good work! Wayback is really becoming a beautiful piece of
software.

Cheers,

Thomas Beekman
Technical Lead KB (National Library of the Netherlands)

[Archive-access-discuss] split a segment

From: Miguel C. <mig...@fc...> - 2008-02-18 16:26:04

 Hi,

Does anyone know how to split a segment into N sub-segments?

With the "org.apache.nutch.segment.SegmentMerger -split" command I can split
a segment per number of URLs, but how to split into N parts?

Regards,

-- Miguel Costa

Re: [Archive-access-discuss] how to partition the index?

From: Miguel C. <mig...@fc...> - 2008-02-15 15:36:15

Hi Lee,
 
Thanks for your reply.
 
I have 2 doubts about your response
 
1- After I deploy on 10 (n) machines, should I index locally each subset in
parallel on the 10 machines or distributly (indexing the 10 subsets
sequently)?
2 - If I split the ARCS, the ranking values will use local statistics from
the ARC subset or global statistics from all the collection and web graph.
If local the ranking will not be normalized between subsets, if global, when
are these values merged? At runtime during query responses?
 
Regards,

  _____  

From: arc...@li...
[mailto:arc...@li...] On Behalf Of
John H. Lee
Sent: terça-feira, 5 de Fevereiro de 2008 20:30
To: arc...@li...
Subject: Re: [Archive-access-discuss] how to partition the index?


Hi Miguel. 

To use distributed search, you need to plan ahead a bit and generate
multiple indices. I don't know of a way to partition an existing large index
into smaller chunks.

For example, if you're indexing 100,000 ARCs and want to deploy on 10
machines, you should split your list of ARCs into 10 chunks of 10,000,
invoke ImportArcs for each chunk, and invoke NutchwaxIndexer for each chunk.
This will produce 10 segment/index pairs, each of which could be deployed on
one of your 10 machines.

For large jobs, I usually split the ARCs into groups of 1000. This produces
segment/index pairs that are small enough to be manageable and flexible when
it comes to deployment layout.

Hope this helps.

-J



On Feb 5, 2008, at 5:12 AM, Miguel Costa wrote:


Hi  to all,
 
After reading the nutchwax + nutch documentation I can index ARC files and
search them using the nutchwax + wayback machine.
However, I would like to perform a distributed search but I don't find any
documentation on how to partition the index in n parts/segments for n
machines. 
On the other hand there is information explaining how to distribute search
using the search-servers.txt file, but I need to partition the index first.
Can anyone explain me or give me a clue on how to partition an index for n
machines?
 
Regards,
 
Miguel Costa
 
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_____________________
__________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] "warc-indexer" not working in WB 1.2.0

From: Oskar G. <Osk...@kb...> - 2008-02-15 11:03:42

Hi!

I've downloaded WB 1.2.0 (which is excellent by the way) and got it  
working right away with arc-files. But when I later turned my  
attention warc-files (downloaded with Heritrix 1.12.1)  I couldn't  
get it to work at first. When using "warc-indexer" to create cdx- 
files it just spitted out a NullPointerException and crashed.

After some browsing through the code and testing I found that it was  
caused by the fact that the method getUrl() in ArchiveRecordHeader in  
the jar-file commons-2.0.1-SNAPSHOT.jar ALWAYS returned null. Why  
that is I haven't looked at, but it caused line 300 in  
WARCRecordToSearchResultAdapter.java ( String origHost =  
uriStr.substring(WaybackConstants.DNS_URL_PREFIX.length()); ) to  
throw the NPE.

The solution for me was to replace "commons-2.0.1-SNAPSHOT.jar" with  
"heritrix.1.12.1.jar" and then it worked fine.

Best regards,
Oskar

Re: [Archive-access-discuss] NutchWax Index Merging

From: stack <st...@du...> - 2008-02-08 16:51:13

Pope, Jackson wrote:
>
> Hiya all,
>
> I’ve created a lot of nutchwax indices deployed the segments and index 
> for each to the search directory, and got nutchwax/wayback to search 
> these successfully.
>
> However, when I try to add more than 40 I hit the ‘too many open 
> files’ problem I mentioned before. Several people have suggested 
> upping the ‘ulimit’ to 32678, but I’ve already got it set to 1024, so 
> upping it to 32768 would theoretically allow me to create 30 x 40 
> indices, still an order of magnitude smaller than I need.
>

Regards 1024, do the math. Each index is made of, say, 20 files (Do a 
listing of an index to figure for sure). 40 * 20 = 800 not counting the 
other files the application needs to open (jar files, configuration 
files, etc.). As you can see, 1024 probably ain't enough.

Searching many indices is slower than searching a single index. Thats 
another reason to do merging.

>
> Next step I’ve tried is index merging.
>
> I’ve run the IndexMerger over some of my indices successfully, but 
> when I replace the indexes directory (which contains the individual 
> indices) with the new index, nutchwax stops working. It tells me that 
> it’s found some hits for my search term, but it doesn’t list them, and 
> wayback claims the index is unavailable. What else do I need to do to 
> deploy a merged index?
>

Any exceptions in tomcat log? Or looking at the logging, is it looking 
in right place for index?


You might need to add an empty index.done to the merged index if its not 
there already (See end of this FAQ: 
http://archive-access.sourceforge.net/projects/nutchwax/faq.html#incremental) 
-- but I'm fuzzy on this stuff so that might not be it.

St.Ack


> Cheers,
>
> Jack
>
> Jackson Pope
>
> Technical Lead
>
> Web Archiving Team
>
> The British Library
>
> +44 (0)1937 54 6942
>
> **************************************************************************
> Experience the British Library online at www.bl.uk <http://www.bl.uk/>
> The British Library’s new interactive Annual Report and Accounts 
> 2006/07 : www.bl.uk/mylibrary <http://www.bl.uk/mylibrary>
> Help the British Library conserve the world's knowledge. Adopt a Book. 
> www.bl.uk/adoptabook <http://www.bl.uk/adoptabook>
> The Library's St Pancras site is WiFi - enabled
> *************************************************************************
> The information contained in this e-mail is confidential and may be 
> legally privileged. It is intended for the addressee(s) only. If you 
> are not the intended recipient, please delete this e-mail and notify 
> the pos...@bl... <mailto:pos...@bl...> : The contents of this 
> e-mail must not be disclosed or copied without the sender's consent.
> The statements and opinions expressed in this message are those of the 
> author and do not necessarily reflect those of the British Library. 
> The British Library does not take any responsibility for the views of 
> the author.
> *************************************************************************
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

[Archive-access-discuss] NutchWax Index Merging

From: Pope, J. <Jac...@bl...> - 2008-02-08 11:38:04

Hiya all,

I've created a lot of nutchwax indices deployed the segments and index
for each to the search directory, and got nutchwax/wayback to search
these successfully.

However, when I try to add more than 40 I hit the 'too many open files'
problem I mentioned before. Several people have suggested upping the
'ulimit' to 32678, but I've already got it set to 1024, so upping it to
32768 would theoretically allow me to create 30 x 40 indices, still an
order of magnitude smaller than I need.

Next step I've tried is index merging.

I've run the IndexMerger over some of my indices successfully, but when
I replace the indexes directory (which contains the individual indices)
with the new index, nutchwax stops working. It tells me that it's found
some hits for my search term, but it doesn't list them, and wayback
claims the index is unavailable. What else do I need to do to deploy a
merged index?

Cheers,

Jack

Jackson Pope

Technical Lead

Web Archiving Team

The British Library

+44 (0)1937 54 6942

**************************************************************************

Experience the British Library online at www.bl.uk

The British Library's new interactive Annual Report and Accounts 2006/07 : www.bl.uk/mylibrary

Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook

The Library's St Pancras site is WiFi - enabled

*************************************************************************

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*************************************************************************

[Archive-access-discuss] [ANN] wayback-1.2.0 released

From: Brad T. <br...@ar...> - 2008-02-07 02:19:49

Wayback is an open-source java implementation of the Internet Archive's 
Wayback Machine service. The 1.2.0 release includes support for 
compressed and uncompressed ARC and WARC files, support for duplicate 
reduction WARC records, a new JavaScript-free ArchivalURL replay mode, 
and many bug fixes and other minor enhancements. For detailed features 
and changes, please see the Release Notes page on the Wayback project 
site at 
http://archive-access.sourceforge.net/projects/wayback/release_notes.html.

Yours,

Internet Archive Webteam

Re: [Archive-access-discuss] missing port number in wayback 1.0.1 search results

From: Brad T. <br...@ar...> - 2008-02-06 00:06:34

Good question. This was a bug/missing feature in the software, but I've 
just tested a checkin to HEAD, which is the 1.2.0 release candidate that 
addresses this issue. We're still not handling non-http protocols 
correctly, but this will wait till 1.4.0, which will have a new index 
format that will allow better searches, and should expose additional 
search options via the UI, allowing end users to relax canonicalization 
if they are not finding the documents they want.

So, as of now, the following tar.gz is the release candidate, and should 
fix this issue, as well as numerous other bugs.

http://builds.archive.org:8080/maven2/org/archive/wayback/wayback/1.1.0-SNAPSHOT/wayback-1.1.0-20080204.230115-24-1.1.0-SNAPSHOT.tar.gz

Let me know if this works for you, and if you find any other problems 
with this version.

Brad


Chris Vicary wrote:
> Hi all,
>
> I am having a problem retrieving harvested resources whose urls include port
> numbers using Wayback 1.0.1. We have a seed that includes a port number that
> was harvested using heritrix. The resulting arc files were indexed using
> wayback, and the urls stored in the index include the port number. Using the
> wayback web address search interface, I am able to find the urls by
> including the port number in the search string (if the port number is not
> included, no results are found - which is expected). The link for the search
> result does not include the port number, however, and clicking it does not
> retrieve the harvested resource. If the port number is inserted into the
> search result link, retrieval works fine. Even so, rewritten links on the
> retrieved page do not include a port number where applicable. So my question
> is, how do I ensure that port numbers are preserved in wayback search
> results and in rewritten links?
>
> Thanks,
>
> Chris
>
>   
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 29 30 31 32 33 .. 43 > >> (Page 31 of 43)