archive-access-discuss Mailing List for Web Archive Access Utilities (Page 5)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 3 4 5 6 7 .. 43 > >> (Page 5 of 43)

[Archive-access-discuss] IA Maven repository down?

From: Nicolas G. <nik...@gm...> - 2012-11-06 12:32:47

Hi,

I am currently unable to connect to Internet Archive's Maven repo,
http://builds.archive.org:8080/maven2. Which is bad since I need to
build the latest Wayback sources.

Anyone has the same issue?

Cheers,



-- 
Nicolas Giraud
---------------------------------------------------------------------------------------------
Développeur Archives du Web - Bibliothèque Nationale de France
Web Archiving Developper - National Library of France
---------------------------------------------------------------------------------------------

Re: [Archive-access-discuss] tomcat connector for wayback's proxy mode replay

From: Pranay P. <pra...@gm...> - 2012-11-02 19:13:30

Hi Bjarne/Lauren

Thanks for the replies. It works fine with simply a basic connector
definition.

Thanks,
Pranay


On Wed, Oct 31, 2012 at 4:52 PM, Bjarne Andersen <bj...@st...>wrote:

> Im not near a computer so a complete configuration I Cant give you
> But you "just" have to a a Connector similar to the one existing in a
> default tomcat and change the two port numbers - eg
> <Connector port="8090" protocol="HTTP/1.1"
>                connectionTimeout="20000"
>                redirectPort="8453" />
>
> This Will allow the tomcat to listen on both the default 8080 and the new
> 8090
>
> With this the wayback Can serve both default playback and proxy replay at
> the same time
>
> Best
> Bjarne Andersen
> Netarchive.dk
>
> Sendt fra min iPhone
>
> Den 31/10/2012 kl. 21.36 skrev "Pranay Pramod" <pra...@gm...>:
>
>
> I am trying to configure wayback to work in proxy mode.
>
> Going through the documentation
> http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html
> I realize that I need to define a connector in tomcat's server.xml
> Any idea what all properties need to go in the connector definition?
>
> A working definition should help.
>
> Thanks,
> Pranay
> Crawl Engineer
> Library of Congress (Contractor).
>
> <ATT00001.c>
>
> <ATT00002.c>
>
>


-- 

Best,
Pranay

Re: [Archive-access-discuss] tomcat connector for wayback's proxy mode replay

From: Bjarne A. <bj...@st...> - 2012-10-31 21:13:13

Im not near a computer so a complete configuration I Cant give you
But you "just" have to a a Connector similar to the one existing in a default tomcat and change the two port numbers - eg
<Connector port="8090" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8453" />

This Will allow the tomcat to listen on both the default 8080 and the new 8090

With this the wayback Can serve both default playback and proxy replay at the same time

Best
Bjarne Andersen
Netarchive.dk<http://Netarchive.dk>

Sendt fra min iPhone

Den 31/10/2012 kl. 21.36 skrev "Pranay Pramod" <pra...@gm...<mailto:pra...@gm...>>:


I am trying to configure wayback to work in proxy mode.

Going through the documentation http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html
I realize that I need to define a connector in tomcat's server.xml
Any idea what all properties need to go in the connector definition?

A working definition should help.

Thanks,
Pranay
Crawl Engineer
Library of Congress (Contractor).

<ATT00001.c>
<ATT00002.c>

Re: [Archive-access-discuss] tomcat connector for wayback's proxy mode replay

From: Ko, L. <Lau...@un...> - 2012-10-31 20:47:55

Hi Pranay,
Something like:
    <Connector port="8090" protocol="HTTP/1.1" 
               connectionTimeout="20000" 
               redirectPort="8443" />

Lauren Ko
Web Archiving Programmer
UNT Libraries
________________________________________
From: Pranay Pramod [pra...@gm...]
Sent: Wednesday, October 31, 2012 3:35 PM
To: arc...@li...
Subject: [Archive-access-discuss] tomcat connector for wayback's proxy mode     replay

I am trying to configure wayback to work in proxy mode.

Going through the documentation http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html
I realize that I need to define a connector in tomcat's server.xml
Any idea what all properties need to go in the connector definition?

A working definition should help.

Thanks,
Pranay
Crawl Engineer
Library of Congress (Contractor).

[Archive-access-discuss] tomcat connector for wayback's proxy mode replay

From: Pranay P. <pra...@gm...> - 2012-10-31 20:35:40

I am trying to configure wayback to work in proxy mode.

Going through the documentation
http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html
I realize that I need to define a connector in tomcat's server.xml
Any idea what all properties need to go in the connector definition?

A working definition should help.

Thanks,
Pranay
Crawl Engineer
Library of Congress (Contractor).

[Archive-access-discuss] Wayback advanced search and date fields

From: Drazenko C. <dra...@sr...> - 2012-10-29 10:36:41

Hi,

what date format should be used in Wayback advanced search (Exact Date, 
Earliest Date, Latest Date)?

I tried: yyyymmdd*, yyyymmdd, yyyymmddhhmmss, yyyy-mm-dd, yyyy/mm/dd, 
yy/mm/dd, dd.mm.yyyy. and I always get all the results as if I left the 
date fields empty. Does it work at all?

Should I configure something in wayback.xml?

I am using Wayback vertison 1.7.1 build 25.

Thanks,
Drazenko Celjak

[Archive-access-discuss] Wayback 1.6.0 - NullPointerException in SelfRedirectFilter

From: Nicolas G. <nik...@gm...> - 2012-10-23 17:12:43

Attachments: wayback.xml

Hi,

I have deployed Wayback 1.6.0 in proxy mode. When I try to start a
replay session on some websites ("http://www.bnf.fr" for instance) I
get the following exception:

java.lang.NullPointerException
java.lang.String.compareTo(String.java:1167)
org.archive.wayback.resourceindex.filters.SelfRedirectFilter.filterObject(SelfRedirectFilter.java:63)
org.archive.wayback.resourceindex.filters.SelfRedirectFilter.filterObject(SelfRedirectFilter.java:36)
org.archive.wayback.util.ObjectFilterChain.filterObject(ObjectFilterChain.java:81)
org.archive.wayback.util.ObjectFilterIterator.hasNext(ObjectFilterIterator.java:61)
org.archive.wayback.resourceindex.LocalResourceIndex.doCaptureQuery(LocalResourceIndex.java:185)
org.archive.wayback.resourceindex.LocalResourceIndex.query(LocalResourceIndex.java:275)
org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:309)
org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:213)
org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:183)
org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:109)

Looking at the code it expects to find a scheme at the beginning of
the redirected URL, but if it come from the location header there is
absolutely no guarantee about that. Is this a bug, or is there some
kind of configuration I got wrong? I've included my wayback.xml file.


Best regards,

--

Nicolas Giraud
---------------------------------------------------------------------------------------------
Développeur Archives du Web - Bibliothèque Nationale de France
Web Archiving Developper - National Library of France
---------------------------------------------------------------------------------------------

[Archive-access-discuss] Page Provenance

From: Colin R. <cs...@st...> - 2012-09-03 08:50:18

Hi,

I have a basic question about wayback replay, but I haven't been able to 
find the answer in the docs. When viewing pages that may have been 
harvested multiple times, how does wayback select/reject which elements 
on a given page to show in a given rendering? The question arises 
because we have been looking at the logs for viewing a particular 
harvest from 2005, but some of the elements (actually, I think, only the 
favicon.ico) are being fetched from a much later harvest (2011). Is 
there any way of controlling the time-window of objects fetched in a 
given rendering?

regards,
Colin Rosenthal
IT Developer
State and University Library, Aarhus

[Archive-access-discuss] Fw: [archive-crawler] Re: H3.0 config settings to crawl password protected pages.

From: Pranay P. <ssp...@ya...> - 2012-08-01 14:55:39


I am revisiting this question after a while.
How is wayback expected to display pages that are password protected?

As Noah pointed out as well, is it the case that wayback replays only the 401 (very first fetch attempted) and doesn't consider the 2nd fetch which is 200 OK?

I have crawled couple of password protected and HTML login form based sites, however wayback display seems to pick 401 responses only.


Thanks,
Pranay

----- Forwarded Message -----
From: Pranay Pandey <ssp...@ya...>
To: "arc...@ya..." <arc...@ya...> 
Sent: Wednesday, April 25, 2012 10:04 AM
Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages.
 

  

That makes sense. I do see the timestamp for 200 being little ahead of 401.

Even I suspect, wayback could be the culprit in improper handling of 401s. I will inquire about it on the access listserv.


Thanks Noah!

Pranay.


________________________________
 From: Noah Levitt <nl...@ar...>
To: arc...@ya... 
Cc: Pranay Pandey <ssp...@ya...> 
Sent: Tuesday, April 24, 2012 9:33 PM
Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages.
 

  
Hello Pranay,

The way it works, heritrix requests the url, gets the 401, then
    requeues the url, noting that next time it should try with the
    credential. Next time around the url is crawled again with auth. The
    first time through, on the 401, almost all the normal processing
    completes, short of logging to crawl.log. Full warc records are
    written. Normally this includes a request and a metadata record. So
    that's the only thing that's puzzling about your scenario. There
    should also be 2 request records and 2 metadata records, and the
    timestamps on the 401 and 200 should be different. 

As far as wayback goes, it's quite possible that it doesn't handle
    401s well. (Maybe the ideal behavior would be to replicate what
    happens on the live web, that is, replay the 401, look for basic
    auth matching the request record of the 200 and only serve the
    content in that case.)

Noah

On 03/12/2012 07:58 AM, Pranay Pandey wrote: 

>
Group,
>
>I am using 3.1.0 and I find it very strange to see just one
        response code (200) in crawl.log and two response codes (200 and
        401) in form of two response records in the warc written.
>I am trying to crawl a password protected site with the config
        settings I had outlined in my older email below.
>
>My question:
>Are we ever supposed to get two "response" records in the WARC
        for the same object, fetched at the same time? 
>While the first response gives 401, the second one has 200 OK.
>
>And while I see 200 OK written in both WARC and crawl.log,
        wayback displays the authentication failed page (401) for all
        the protected pages.
>
>These are the two responses I see in the WARC, in the order they
        were written.
>
>WARC/1.0^M
>WARC-Type: response^M
>WARC-Target-URI: http://www.xyz.com/index.html/^M
>WARC-Date: 2012-03-09T19:27:57Z^M
>WARC-Payload-Digest: sha1:XNMHLNSS4YF24HWXC52LDEETXXVHV47X^M
>WARC-IP-Address: 207.45.182.58^M
>WARC-Record-ID:
        <urn:uuid:513f8d8f-edcc-42f8-ac07-136d2dcfe4a4>^M
>Content-Type: application/http; msgtype=response^M
>Content-Length: 3403^M
>^M
>HTTP/1.1 401 Authorization Required^M
>Date: Fri, 09 Mar 2012 19:27:56 GMT^M
>Server: Apache^M
>WWW-Authenticate: Basic realm="Netpreserve"^M
>Accept-Ranges: bytes^M
>Connection: close^M
>Content-Type: text/html^M
>
>WARC/1.0^M
>WARC-Type: response^M
>WARC-Target-URI: http://www.xyz.com/index.html/^M
>WARC-Date: 2012-03-09T19:27:57Z^M
>WARC-Payload-Digest: sha1:FGAN7GIN5JNYCHGL6BCB3TDPKDCFUJER^M
>WARC-IP-Address: 207.45.182.58^M
>WARC-Record-ID:
        <urn:uuid:f7f4761d-5c3b-4460-8ace-fed4e3477cd3>^M
>Content-Type: application/http; msgtype=response^M
>Content-Length: 6052^M
>^M
>HTTP/1.1 200 OK^M
>Date: Fri, 09 Mar 2012 19:27:56 GMT^M
>Server: Apache^M
>X-Powered-By: PHP/5.2.17^M
>Expires: Thu, 19 Nov 1981 08:52:00 GMT^M
>Cache-Control: no-store, no-cache, must-revalidate,
        post-check=0, pre-check=0^M
>Pragma: no-cache^M
>Set-Cookie: PHPSESSID=7289c28a8ac5ba1b87aead5f94bc5473; path=/^M
>Content-Length: 5687^M
>Connection: close^M
>Content-Type: text/html^M
>
>
>
>
>________________________________
> From: Patrick <la...@ya...>
>To: arc...@ya... 
>Sent: Monday, February 27, 2012 11:04 AM
>Subject: [archive-crawler] Re: H3.0 config settings to crawl password protected pages.
> 
>
>  
>I have got the same problem. Have you worked out the solution?
>
>--- In arc...@ya..., Pranay Pandey <sspranay@...> wrote:
>>
>> 
>> OK, I took a shot at configuring the beans, it
                    doesn't seem to be helping.
>> The job gets build and run but doesn't pass the
                    authentication stage.
>> Any suggestion(s)?
>> 
>> This is what I tried :
>> ------------
>> <bean id="HttpAuthenticationCredential"
                    class="org.archive.modules.credential.HttpAuthenticationCredential">
>> Â Â Â  <property name="domain"
                    value="www.mysite.com"/>
>> Â Â Â  <property name="login"
                    value="xxxxx"/>
>> Â Â Â  <property name="password"
                    value="xxxxxx"/>
>> </bean>
>> 
>> <bean id="credentialStore"
                    class="org.archive.modules.credential.CredentialStore">
>> Â Â  <property name="credentials">
>> Â Â Â Â Â Â  <map>
>> Â Â Â Â Â Â Â Â Â Â  <entry
                    key="credentials"
                    value-ref="HttpAuthenticationCredential" />
>> Â Â Â Â Â Â  </map>
>> Â Â  </property>
>> </bean>
>> --------------
>> And inside fetchFTTP, I add the property
                    "credentialStore" as below.
>> 
>> Â <bean id="fetchHttp"
                    class="org.archive.modules.fetcher.FetchHTTP">
>> Â Â  <property name="credentialStore">
>> <ref bean="credentialStore"/>
>> Â </property>
>> 
>> Thanks,
>> Pranay
>> 
>> --- On Thu, 6/16/11, Pranay Pandey <sspranay@...> wrote:
>> 
>> From: Pranay Pandey <sspranay@...>
>> Subject: [archive-crawler] H3.0 config settings
                    to crawl password protected pages.
>> To: "arc...@ya..." <arc...@ya...>
>> Date: Thursday, June 16, 2011, 11:16 AM
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Â 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Â Hello,
>> 
>> I have been looking around to get the related
                    beans setting configuration to be able to crawl
                    password protected sites/pages.
>> Does anyone has it handy for H-3.0? 
>> 
>> Thanks!
>> Pranay
>>
>
>
>
>



__._,_.___
Reply to sender | Reply to group | Reply via web post | Start a New Topic Messages in this topic (6) 
Recent Activity: 	* New Members 1   
Visit Your Group 
 
Switch to: Text-Only, Daily Digest • Unsubscribe • Terms of Use
. 

__,_._,___

[Archive-access-discuss] Wayback over https

From: Ilja S. <ilj...@he...> - 2012-07-27 13:16:36

Hello,

I've been trying to use wayback machine (1.4.2., upgrading to 1.6 is not 
possible currently) over https, but so far, the correct configuration 
keeps evading me.

1) I've set a DNS alias to point to my wayback server (not sure if relevant)

2) Apache is forwarding requests to port 443 to tomcat via ajp:

	ProxyPass / ajp://localhost:8009/

3) I have an access point setup as follows:

  <bean name="80" class="org.archive.wayback.webapp.AccessPoint">
     <property name="collection" ref="localcdxcollection" />
     <property name="replay" ref="archivalurlreplay" />
     <property name="query">
       <bean class="org.archive.wayback.query.Renderer">
         <property name="captureJsp" 
value="/WEB-INF/query/CalendarResults.jsp" />
       </bean>
     </property>
     <property name="uriConverter">
       <bean 
class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
             <property name="replayURIPrefix" value="https://my.fqdn/"/>
       </bean>
     </property>
     <property name="parser">
       <bean 
class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser">
         <property name="maxRecords" value="1000" />
         <property name="earliestTimestamp" value="2006" />
       </bean>
     </property>
     <property name="exclusionFactory" ref="static-exclusion" />
     <property name="exactSchemeMatch" value="false" />
   </bean>

I have also another, working access point (basically localhost:8080). If 
I use http (also in apache), this access point works. Now I get null for 
my AccessPoint when I try to access it from my jsp pages.

Any hints on what I am doing wrong would be greatly appreciated.


Ilja Sidoroff

Re: [Archive-access-discuss] harvested content gone blank

From: Bert W. <bwe...@fr...> - 2012-07-19 14:20:30

Allen,

Linux systems are often configured to perform a 'tmpwatch' [1] on a 
regular basis by cron to delete files in /tmp which haven't been used for 
a certain period of time, leaving the directories empty. So look at your 
system configuration if you find something like /etc/cron.daily/tmpwatch.

If this is the case, you may
- deactivate tmpwatch or
- (better) configure it to not touch /tmp/wayback anymore or
- (even better) move your wayback directory to somewhere else than /tmp.

It's never good to store files that you want to keep in /tmp.

Hope this helps,
   Bert

[1] http://linux.die.net/man/8/tmpwatch


On Thu, 19 Jul 2012, 17:22, Allen Sim wrote:

> Hi all,
> I have encountered a problem: all my harvested websites are stored in /tmp/wayback and processing in /tmp/wayback/files1.
> It stores in a format as following:
> 819224/1/IAH-20110710042453-00000-kgpnssrrs060.arc, IAH-20110710042453-00000-kgpnssrrs060.cdx
> 819224/logs/crawl.log , progress-statistics.log, uri-errors.log
> 819224/reports/crawl-manifest.txt, host-report.txt, processor-report.txt and so on..
> BUT, form time to time I noticed that all the content inside the folder will be gone blank and leave the blank folder:
> 819224/1/BLANK
> 819224/logs/BLANK
> 819224/reports/BLANK
> Luckily I have the backup.
> My question:
> 1. Is it because of storing my harvested at /tmp folder and from time to time the content will be removed?
> 2.Is it because of my hard-disk space not sufficient and causing all the content go blank?
> 
> Please advice and looking forward to heard from you.
> 
> Regards,
> Allen
> 
> 
>

[Archive-access-discuss] harvested content gone blank

From: Allen S. <all...@gm...> - 2012-07-19 09:23:09

Hi all,
I have encountered a problem: all my harvested websites are stored in
/tmp/wayback and processing in /tmp/wayback/files1.
It stores in a format as following:
819224/1/IAH-20110710042453-00000-kgpnssrrs060.arc,
IAH-20110710042453-00000-kgpnssrrs060.cdx
819224/logs/crawl.log , progress-statistics.log, uri-errors.log
819224/reports/crawl-manifest.txt, host-report.txt, processor-report.txt
and so on..
BUT, form time to time I noticed that all the content inside the folder
will be gone blank and leave the blank folder:
819224/1/BLANK
819224/logs/BLANK
819224/reports/BLANK
Luckily I have the backup.
My question:
1. Is it because of storing my harvested at /tmp folder and from time to
time the content will be removed?
2.Is it because of my hard-disk space not sufficient and causing all the
content go blank?

Please advice and looking forward to heard from you.

Regards,
Allen

Re: [Archive-access-discuss] error starting Wayback: "Error filterStart"

From: Nicholas T. <ta...@gm...> - 2012-07-14 18:53:26

And...upgrading to Tomcat 6.0.35 did the trick! I was able to get 
Wayback 1.6.1 to run in a non-ROOT context without any modifications 
made to the configuration. Thanks again for your help, Lauren!

~Nicholas

Re: [Archive-access-discuss] error starting Wayback: "Error filterStart"

From: Nicholas T. <ta...@gm...> - 2012-07-14 18:27:22

Hi Lauren, thanks for the suggestions. I had checked to see whether 
Wayback would run before I modified the configuration (didn't make any 
difference), and the catalina.<datestamp>.log didn't provide any more 
descriptive information before the "Error filterStart" entry.

I tried a few other strategies but was still unable to get it to work:

1) Set wayback.basedir=/home/sansforensics/wayback/indexes (removed the 
trailing slash since the places where the variable is used elsewhere 
didn't expect there to be one). Didn't make any difference.

2) Installed Brad's updated Wayback 1.6.1 to a non-ROOT context and 
tried running it without modifying the configuration. Tomcat Manager 
still says the application failed to start and the 
catalina.<datestamp>.log reports "Error filterStart".

3) Updated Wayback 1.6.1 with the configuration I laid out in my 
previous e-mail. Same errors.

4) Installed Wayback 1.6.1 to the ROOT context and tried to get it to 
run both with and without having modified the configuration. Same errors.

At this point, all I can think of to do is to upgrade Tomcat 6 to a more 
recent version or just try running Wayback on Windows via Cygwin. I 
still feel like I must be missing something simple and fundamental, but 
I'm perplexed as to what it could be.

~Nicholas

Re: [Archive-access-discuss] error starting Wayback: "Error filterStart"

From: Ko, L. <Lau...@un...> - 2012-07-10 23:25:22

Hi Nicholas,
Since I didn't see any responses to you on the list, I will respond even though I don't have a solution. I tried your wayback-1.6.0 configuration below, and did not get that error. Though my environment isn't exactly the same:
Ubuntu 10.04
jdk1.6.0_07
apache-tomcat-6.0.35

Did you check to see if Wayback would run before you made the configuration changes? Was there anything else of interest in catalina.out before the filterStart error? I will say that though wayback-1.6.0 will launch for me, when running it in a non-Root context as you are, this version of Wayback hasn't worked well for me. Here are some messages about it http://sourceforge.net/mailarchive/message.php?msg_id=27425763.

Lauren Ko
Web Archiving Programmer
UNT Libraries
________________________________________
From: Nicholas Taylor [ta...@gm...]
Sent: Thursday, July 05, 2012 1:23 PM
To: arc...@li...
Subject: [Archive-access-discuss] error starting Wayback: "Error filterStart"

Hello archive-access-discussers, I installed Wayback based on minor
modifications to the instructions here:
https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide
but Tomcat reports "Error filterStart" when I try to start it. I'm
hoping it's a simple configuration error that someone more experienced
with Wayback and/or Tomcat could help me figure out.

Environment:
Ubuntu 9.10
Java 1.6.0_24-b07
Tomcat 6.0.20

Tomcat is running and I've been trying to start Wayback from the
Tomcat Manager. Wayback is installed in the "wayback-1.6.0" context.

I made the following edits to wayback.xml:

wayback.basedir=/home/sansforensics/wayback/indexes/
wayback.urlprefix=http://localhost:8080/wayback-1.6.0/

Changed the bean named "8080:wayback" to "8080:test"

Changed the four instances of "${wayback.urlprefix}/" in that bean to
"${wayback.urlprefix}test/"

I made the following changes to BDBcollection.xml:

In the bean with id "datadirs"
<property name="name" value="warcfiles" />
<property name="prefix" value="/home/sansforensics/wayback/warcs/" />

Both dirs:
/home/sansforensics/wayback/indexes/
/home/sansforensics/wayback/warcs/
exist, are empty, and have 777 permissions.

Here is sample output from a catalina.<datestamp>.log:
Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors
Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors

Any ideas?

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] error starting Wayback: "Error filterStart"

From: Nicholas T. <ta...@gm...> - 2012-07-05 18:24:38

Hello archive-access-discussers, I installed Wayback based on minor
modifications to the instructions here:
https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide
but Tomcat reports "Error filterStart" when I try to start it. I'm
hoping it's a simple configuration error that someone more experienced
with Wayback and/or Tomcat could help me figure out.

Environment:
Ubuntu 9.10
Java 1.6.0_24-b07
Tomcat 6.0.20

Tomcat is running and I've been trying to start Wayback from the
Tomcat Manager. Wayback is installed in the "wayback-1.6.0" context.

I made the following edits to wayback.xml:

wayback.basedir=/home/sansforensics/wayback/indexes/
wayback.urlprefix=http://localhost:8080/wayback-1.6.0/

Changed the bean named "8080:wayback" to "8080:test"

Changed the four instances of "${wayback.urlprefix}/" in that bean to
"${wayback.urlprefix}test/"

I made the following changes to BDBcollection.xml:

In the bean with id "datadirs"
<property name="name" value="warcfiles" />
<property name="prefix" value="/home/sansforensics/wayback/warcs/" />

Both dirs:
/home/sansforensics/wayback/indexes/
/home/sansforensics/wayback/warcs/
exist, are empty, and have 777 permissions.

Here is sample output from a catalina.<datestamp>.log:
Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors
Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start
SEVERE: Error filterStart
Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start
SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors

Any ideas?

Re: [Archive-access-discuss] Crawling Flash and Javascript

From: Adam M. <ad...@ar...> - 2012-06-28 21:05:02

I've been working on an external browser processor to plug into the processor chain within Heritrix3.  It is in a pretty early stage of development, but is functional. 

The extractor processor is here: https://github.com/adam-miller/ExternalBrowserExtractorHTML
I've worked with two different headless browsers, phantomJS and zombieJS. So far, phantom has performed the best for me. My phantomJS script is here:
https://github.com/adam-miller/phantomBrowserExtractor

It will not run flash, but will run javascript and log all asynchronous requests to queue them in H3. So far, the main limitation with phantomJS is that it is going to request all of the content to render the page. This will cause duplicate requests since heritrix will be downloading the content on its own. I've been working with customizing phantomJS to prevent these duplicate requests, but I don't have any code for that online yet.



~Adam Miller


> 
>> From: Jon Walton <jon...@gm...>
>> Date: June 28, 2012 11:51:11 AM PDT
>> To: Erik Hetzner <eri...@uc...>
>> Cc: "arc...@li..." <arc...@li...>
>> Subject: Re: [Archive-access-discuss] Crawling Flash and Javascript
>> 
>> 
>> 
>> Hi Anne,
>> 
>> You might try the archive-crawler mailing list as well.
>> 
>> All of us have encountered these issues. Capturing javascript & flash
>> content is difficult. Replaying this content is even harder.
>> 
>> Whether it is a Heritrix or a Wayback issue depends: it’s probably
>> both. If you can figure out what content needs to be captured in order
>> for a site to work, you can then check your Heritrix crawl.log files
>> to see if that content was captured. Heritrix is highly configurable
>> and if you discover that Heritrix is not capturing the content you
>> want, you may be able to change the configuration to make it capture
>> what you want.
>> 
>> After you have ensured that you are capturing the content, you can
>> begin to evaluate whether Wayback is properly replaying the content.
>> Whether Wayback can or is properly replaying the content depends on
>> your Wayback configuration. For example, proxy mode can probably
>> replay most content correctly, while I doubt that client-side
>> rewriting will ever work very well.
>> 
>> Finally, the only real way to test if this is fixed is to try out the
>> new versions of Heritrix & Wayback and evaluate the results.
>> 
>> 
>> I am guessing, but it seems to me that not all web objects are being stored during the Heritrix crawl, due to the fact that Heritrix (any version) does not execute Javascript.    
>> 
>> Has anyone ever considered replacing the core Heritrix 3 web fetcher with something like HTMLUnit, which would execute Javascript via Rhino?    One way to implement this would be to create an optional web client, configured via Spring, which would execute javascript to better render a page at crawl time - resulting in the inclusion of these objects.
>> 
>> As you mentioned, this is probably something that has come up on the crawler list.
>> 
>> Jon
>> 
>> 
>>  
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] Crawling Flash and Javascript

From: Jon W. <jon...@gm...> - 2012-06-28 18:51:18

>
> Hi Anne,
>
> You might try the archive-crawler mailing list as well.
>
> All of us have encountered these issues. Capturing javascript & flash
> content is difficult. Replaying this content is even harder.
>
> Whether it is a Heritrix or a Wayback issue depends: it’s probably
> both. If you can figure out what content needs to be captured in order
> for a site to work, you can then check your Heritrix crawl.log files
> to see if that content was captured. Heritrix is highly configurable
> and if you discover that Heritrix is not capturing the content you
> want, you may be able to change the configuration to make it capture
> what you want.
>
> After you have ensured that you are capturing the content, you can
> begin to evaluate whether Wayback is properly replaying the content.
> Whether Wayback can or is properly replaying the content depends on
> your Wayback configuration. For example, proxy mode can probably
> replay most content correctly, while I doubt that client-side
> rewriting will ever work very well.
>
> Finally, the only real way to test if this is fixed is to try out the
> new versions of Heritrix & Wayback and evaluate the results.
>
>
I am guessing, but it seems to me that not all web objects are being stored
during the Heritrix crawl, due to the fact that Heritrix (any version) does
not execute Javascript.

Has anyone ever considered replacing the core Heritrix 3 web fetcher with
something like HTMLUnit, which would execute Javascript via Rhino?    One
way to implement this would be to create an optional web client, configured
via Spring, which would execute javascript to better render a page at crawl
time - resulting in the inclusion of these objects.

As you mentioned, this is probably something that has come up on the
crawler list.

Jon

Re: [Archive-access-discuss] Crawling Flash and Javascript

From: Erik H. <eri...@uc...> - 2012-06-28 16:55:56

At Wed, 27 Jun 2012 15:23:58 -0400,
Leon, Anne wrote:
> 
> Hi All,
> 
> I have a question regarding crawling Flash and Javascript.
> Currently, I am utilizing Heritrix 1.14.4 and Wayback 1.4.2 and I
> have had issues capturing fully functioning websites. Websites that
> utilize javascript heavily have banners missing or empty widget
> boxes, and Flash content is virtually nonexistent. Within the next
> few months we will be upgrading to the newest versions of both
> programs, but I'm concerned that these problems will still exist.
> 
> So, I'm wondering if any of you have encountered this issues and
> what have you done to remedy them? Is this a Heritrix issue or a
> Wayback issue? And lastly, did upgrading the software fix the
> problems? Thank you all in advance.

Hi Anne,

You might try the archive-crawler mailing list as well.

All of us have encountered these issues. Capturing javascript & flash
content is difficult. Replaying this content is even harder.

Whether it is a Heritrix or a Wayback issue depends: it’s probably
both. If you can figure out what content needs to be captured in order
for a site to work, you can then check your Heritrix crawl.log files
to see if that content was captured. Heritrix is highly configurable
and if you discover that Heritrix is not capturing the content you
want, you may be able to change the configuration to make it capture
what you want.

After you have ensured that you are capturing the content, you can
begin to evaluate whether Wayback is properly replaying the content.
Whether Wayback can or is properly replaying the content depends on
your Wayback configuration. For example, proxy mode can probably
replay most content correctly, while I doubt that client-side
rewriting will ever work very well.

Finally, the only real way to test if this is fixed is to try out the
new versions of Heritrix & Wayback and evaluate the results.

Hope that helps!

best, Erik

[Archive-access-discuss] Crawling Flash and Javascript

From: Leon, A. <Le...@si...> - 2012-06-27 19:48:33

Hi All,

I have a question regarding crawling Flash and Javascript.  Currently, I am utilizing Heritrix 1.14.4 and Wayback 1.4.2 and I have had issues capturing fully functioning websites.  Websites that utilize javascript heavily have banners missing or empty widget boxes, and Flash content is virtually nonexistent.  Within the next few months we will be upgrading to the newest versions of both programs, but I'm concerned that these problems will still exist.

So, I'm wondering if any of you have encountered this issues and what have you done to remedy them? Is this a Heritrix issue or a Wayback issue? And lastly, did upgrading the software fix the problems?  Thank you all in advance.

Anne

[Archive-access-discuss] Wayback deduplication config problem?Please help!

From: Armin S. <Arm...@ui...> - 2012-06-19 13:55:19

Hello List,

i have a problem with my Wayback configuration and it seems i just can't 
figure out what the problem is. Please excuse if this is a noob 
question, as i am fairly new to this.

I imported a large collection of .WARC files into my local Wayback 
instance. Everything is being indexed and all the URLs can be found in 
the archive. However, it seems i can only replay the first capture of 
every one of these URLs. So, if for example the URL www.test.com was 
captured on 27.09, 03.10, and 12.12, i can only replay the capture from 
27.09. Does anyone have an idea how this can be fixed? I'm very thankful 
for any hint you can give me...Thanks a lot in advance!

Re: [Archive-access-discuss] Extracting records from ARC files into new(W)ARC files

From: Bjarne A. <bj...@st...> - 2012-05-24 15:10:15

Thanks Roger.

I got a java program from IA - but it by default required All your content to be stored on HDFS and then using hadoop to extract content. I Don't have that setup so I gave the hanzo warc-tools a shot
I tried their python code last Fall with little luck but they have actually been working on the project and it worked out of the box this time

They have (among several tools)
- arc2warc.py to convert to WARC
- warcfilter.py to filter a WARC file by e.g. URL (regexp)

So using those two it is quite easy to extract material from one or more domains. A tricky situation is still embedded content from other domains that you want to include. The IA/hadoop approach supported that by analysing crawl-logs to find URIs of embedded things found at crawltime

But for this specific case the warc-tools was actually quite helpful

Best
Bjarne


Sendt fra min iPhone

Den 24/05/2012 kl. 16.53 skrev "Coram, Roger" <Rog...@bl...>:

> Hi Bjarne, 
> 
> Only just saw your message. I'm not sure if  you've had better responses
> so far but here's a bash script I've used in the past:
> 
>    https://gist.github.com/2781979
> 
> It should work via, for example: arc2warc -a INPUT_ARC.arc.gz -w
> OUTPUT_WARC.warc.gz -r "http://www\\.bl\\.uk"
> 
> It does have one dependency, a Python script for stripping HTTP headers
> (in order to calculate the digest of the payload):
> 
>    https://gist.github.com/2781967
> 
> However, you can probably remove that and include a WARC-Block-Digest or
> remove it altogether.
> 
> Roger G. Coram
> Web Archiving Engineer
> The British Library
> E: rog...@bl...
> 
> 
> -----Original Message-----
> From: Bjarne Andersen [mailto:bj...@st...] 
> Sent: 11 May 2012 22:04
> To: arc...@li...
> Subject: [Archive-access-discuss] Extracting records from ARC files into
> new(W)ARC files
> 
> Hi.
> A website owner is asking for an extract of material from a specific
> domain
> 
> Anybody aware of a tool that given either complete URLs or a URL regexp
> Would run though an ARC file and write All records into a new (W)ARC
> file?
> 
> Best
> Bjarne Andersen
> 
> 
> 
> Sendt fra min iPhone
> ------------------------------------------------------------------------
> ------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and threat
> landscape has changed and how IT managers can respond. Discussions will
> include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] github repos

From: Noah L. <nl...@ar...> - 2012-05-18 21:18:26

Hello Erik,

https://github.com/internetarchive/wayback is the one.

Noah


On 2012-05-18 14:12 , Erik Hetzner wrote:
> Hi all,
>
> Which is preferred?
>
> - https://github.com/internetarchive/wayback-machine
> - https://github.com/internetarchive/wayback
>
> Thank you!
>
> best, Erik
>
>
>
> Sent from my free software system<http://fsf.org/>.
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] github repos

From: Erik H. <eri...@uc...> - 2012-05-18 21:12:56

Hi all,

Which is preferred?

- https://github.com/internetarchive/wayback-machine
- https://github.com/internetarchive/wayback

Thank you!

best, Erik

Re: [Archive-access-discuss] UURI v. HandyUrl - which?

From: Erik H. <eri...@uc...> - 2012-05-16 21:21:38

At Wed, 16 May 2012 13:44:11 -0700,
Aaron Binns wrote:
>
>
> Erik Hetzner <eri...@uc...> writes:
>
> > A quick question. UURI [1] is located in Heritrix Commons. HandyUrl is
> > located in archive-commons. Which should I use?
>
> Hmmm, it might depend on your needs.  AFAIK, the UURI is geared towards
> Heritrix's needs, which includes a pretty light "normalization" of the
> URL.  From an archival capture point of view, I think the idea is that
> Heritrix shouldn't munge the URL very much.
>
> However, HandyUrl is geared for access/playback/Wayback needs, and as
> such incorporates stronger URL normalization/canonicalization.
>
> I haven't spent much time in the code for either, the above is just my
> thoughts based on informal discussions with Gordon and Brad.

Thanks, Aaron. It sounds like HandyUrl is more appropriate for my
current task.

best, Erik

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 3 4 5 6 7 .. 43 > >> (Page 5 of 43)