You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Nicolas G. <nik...@gm...> - 2012-11-06 12:32:47
|
Hi, I am currently unable to connect to Internet Archive's Maven repo, http://builds.archive.org:8080/maven2. Which is bad since I need to build the latest Wayback sources. Anyone has the same issue? Cheers, -- Nicolas Giraud --------------------------------------------------------------------------------------------- Développeur Archives du Web - Bibliothèque Nationale de France Web Archiving Developper - National Library of France --------------------------------------------------------------------------------------------- |
From: Pranay P. <pra...@gm...> - 2012-11-02 19:13:30
|
Hi Bjarne/Lauren Thanks for the replies. It works fine with simply a basic connector definition. Thanks, Pranay On Wed, Oct 31, 2012 at 4:52 PM, Bjarne Andersen <bj...@st...>wrote: > Im not near a computer so a complete configuration I Cant give you > But you "just" have to a a Connector similar to the one existing in a > default tomcat and change the two port numbers - eg > <Connector port="8090" protocol="HTTP/1.1" > connectionTimeout="20000" > redirectPort="8453" /> > > This Will allow the tomcat to listen on both the default 8080 and the new > 8090 > > With this the wayback Can serve both default playback and proxy replay at > the same time > > Best > Bjarne Andersen > Netarchive.dk > > Sendt fra min iPhone > > Den 31/10/2012 kl. 21.36 skrev "Pranay Pramod" <pra...@gm...>: > > > I am trying to configure wayback to work in proxy mode. > > Going through the documentation > http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html > I realize that I need to define a connector in tomcat's server.xml > Any idea what all properties need to go in the connector definition? > > A working definition should help. > > Thanks, > Pranay > Crawl Engineer > Library of Congress (Contractor). > > <ATT00001.c> > > <ATT00002.c> > > -- Best, Pranay |
From: Bjarne A. <bj...@st...> - 2012-10-31 21:13:13
|
Im not near a computer so a complete configuration I Cant give you But you "just" have to a a Connector similar to the one existing in a default tomcat and change the two port numbers - eg <Connector port="8090" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8453" /> This Will allow the tomcat to listen on both the default 8080 and the new 8090 With this the wayback Can serve both default playback and proxy replay at the same time Best Bjarne Andersen Netarchive.dk<http://Netarchive.dk> Sendt fra min iPhone Den 31/10/2012 kl. 21.36 skrev "Pranay Pramod" <pra...@gm...<mailto:pra...@gm...>>: I am trying to configure wayback to work in proxy mode. Going through the documentation http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html I realize that I need to define a connector in tomcat's server.xml Any idea what all properties need to go in the connector definition? A working definition should help. Thanks, Pranay Crawl Engineer Library of Congress (Contractor). <ATT00001.c> <ATT00002.c> |
From: Ko, L. <Lau...@un...> - 2012-10-31 20:47:55
|
Hi Pranay, Something like: <Connector port="8090" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" /> Lauren Ko Web Archiving Programmer UNT Libraries ________________________________________ From: Pranay Pramod [pra...@gm...] Sent: Wednesday, October 31, 2012 3:35 PM To: arc...@li... Subject: [Archive-access-discuss] tomcat connector for wayback's proxy mode replay I am trying to configure wayback to work in proxy mode. Going through the documentation http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html I realize that I need to define a connector in tomcat's server.xml Any idea what all properties need to go in the connector definition? A working definition should help. Thanks, Pranay Crawl Engineer Library of Congress (Contractor). |
From: Pranay P. <pra...@gm...> - 2012-10-31 20:35:40
|
I am trying to configure wayback to work in proxy mode. Going through the documentation http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html I realize that I need to define a connector in tomcat's server.xml Any idea what all properties need to go in the connector definition? A working definition should help. Thanks, Pranay Crawl Engineer Library of Congress (Contractor). |
From: Drazenko C. <dra...@sr...> - 2012-10-29 10:36:41
|
Hi, what date format should be used in Wayback advanced search (Exact Date, Earliest Date, Latest Date)? I tried: yyyymmdd*, yyyymmdd, yyyymmddhhmmss, yyyy-mm-dd, yyyy/mm/dd, yy/mm/dd, dd.mm.yyyy. and I always get all the results as if I left the date fields empty. Does it work at all? Should I configure something in wayback.xml? I am using Wayback vertison 1.7.1 build 25. Thanks, Drazenko Celjak |
From: Nicolas G. <nik...@gm...> - 2012-10-23 17:12:43
|
Hi, I have deployed Wayback 1.6.0 in proxy mode. When I try to start a replay session on some websites ("http://www.bnf.fr" for instance) I get the following exception: java.lang.NullPointerException java.lang.String.compareTo(String.java:1167) org.archive.wayback.resourceindex.filters.SelfRedirectFilter.filterObject(SelfRedirectFilter.java:63) org.archive.wayback.resourceindex.filters.SelfRedirectFilter.filterObject(SelfRedirectFilter.java:36) org.archive.wayback.util.ObjectFilterChain.filterObject(ObjectFilterChain.java:81) org.archive.wayback.util.ObjectFilterIterator.hasNext(ObjectFilterIterator.java:61) org.archive.wayback.resourceindex.LocalResourceIndex.doCaptureQuery(LocalResourceIndex.java:185) org.archive.wayback.resourceindex.LocalResourceIndex.query(LocalResourceIndex.java:275) org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:309) org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:213) org.archive.wayback.util.webapp.RequestMapper.handleRequest(RequestMapper.java:183) org.archive.wayback.util.webapp.RequestFilter.doFilter(RequestFilter.java:109) Looking at the code it expects to find a scheme at the beginning of the redirected URL, but if it come from the location header there is absolutely no guarantee about that. Is this a bug, or is there some kind of configuration I got wrong? I've included my wayback.xml file. Best regards, -- Nicolas Giraud --------------------------------------------------------------------------------------------- Développeur Archives du Web - Bibliothèque Nationale de France Web Archiving Developper - National Library of France --------------------------------------------------------------------------------------------- |
From: Colin R. <cs...@st...> - 2012-09-03 08:50:18
|
Hi, I have a basic question about wayback replay, but I haven't been able to find the answer in the docs. When viewing pages that may have been harvested multiple times, how does wayback select/reject which elements on a given page to show in a given rendering? The question arises because we have been looking at the logs for viewing a particular harvest from 2005, but some of the elements (actually, I think, only the favicon.ico) are being fetched from a much later harvest (2011). Is there any way of controlling the time-window of objects fetched in a given rendering? regards, Colin Rosenthal IT Developer State and University Library, Aarhus |
From: Pranay P. <ssp...@ya...> - 2012-08-01 14:55:39
|
I am revisiting this question after a while. How is wayback expected to display pages that are password protected? As Noah pointed out as well, is it the case that wayback replays only the 401 (very first fetch attempted) and doesn't consider the 2nd fetch which is 200 OK? I have crawled couple of password protected and HTML login form based sites, however wayback display seems to pick 401 responses only. Thanks, Pranay ----- Forwarded Message ----- From: Pranay Pandey <ssp...@ya...> To: "arc...@ya..." <arc...@ya...> Sent: Wednesday, April 25, 2012 10:04 AM Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. That makes sense. I do see the timestamp for 200 being little ahead of 401. Even I suspect, wayback could be the culprit in improper handling of 401s. I will inquire about it on the access listserv. Thanks Noah! Pranay. ________________________________ From: Noah Levitt <nl...@ar...> To: arc...@ya... Cc: Pranay Pandey <ssp...@ya...> Sent: Tuesday, April 24, 2012 9:33 PM Subject: Re: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. Hello Pranay, The way it works, heritrix requests the url, gets the 401, then requeues the url, noting that next time it should try with the credential. Next time around the url is crawled again with auth. The first time through, on the 401, almost all the normal processing completes, short of logging to crawl.log. Full warc records are written. Normally this includes a request and a metadata record. So that's the only thing that's puzzling about your scenario. There should also be 2 request records and 2 metadata records, and the timestamps on the 401 and 200 should be different. As far as wayback goes, it's quite possible that it doesn't handle 401s well. (Maybe the ideal behavior would be to replicate what happens on the live web, that is, replay the 401, look for basic auth matching the request record of the 200 and only serve the content in that case.) Noah On 03/12/2012 07:58 AM, Pranay Pandey wrote: > Group, > >I am using 3.1.0 and I find it very strange to see just one response code (200) in crawl.log and two response codes (200 and 401) in form of two response records in the warc written. >I am trying to crawl a password protected site with the config settings I had outlined in my older email below. > >My question: >Are we ever supposed to get two "response" records in the WARC for the same object, fetched at the same time? >While the first response gives 401, the second one has 200 OK. > >And while I see 200 OK written in both WARC and crawl.log, wayback displays the authentication failed page (401) for all the protected pages. > >These are the two responses I see in the WARC, in the order they were written. > >WARC/1.0^M >WARC-Type: response^M >WARC-Target-URI: http://www.xyz.com/index.html/^M >WARC-Date: 2012-03-09T19:27:57Z^M >WARC-Payload-Digest: sha1:XNMHLNSS4YF24HWXC52LDEETXXVHV47X^M >WARC-IP-Address: 207.45.182.58^M >WARC-Record-ID: <urn:uuid:513f8d8f-edcc-42f8-ac07-136d2dcfe4a4>^M >Content-Type: application/http; msgtype=response^M >Content-Length: 3403^M >^M >HTTP/1.1 401 Authorization Required^M >Date: Fri, 09 Mar 2012 19:27:56 GMT^M >Server: Apache^M >WWW-Authenticate: Basic realm="Netpreserve"^M >Accept-Ranges: bytes^M >Connection: close^M >Content-Type: text/html^M > >WARC/1.0^M >WARC-Type: response^M >WARC-Target-URI: http://www.xyz.com/index.html/^M >WARC-Date: 2012-03-09T19:27:57Z^M >WARC-Payload-Digest: sha1:FGAN7GIN5JNYCHGL6BCB3TDPKDCFUJER^M >WARC-IP-Address: 207.45.182.58^M >WARC-Record-ID: <urn:uuid:f7f4761d-5c3b-4460-8ace-fed4e3477cd3>^M >Content-Type: application/http; msgtype=response^M >Content-Length: 6052^M >^M >HTTP/1.1 200 OK^M >Date: Fri, 09 Mar 2012 19:27:56 GMT^M >Server: Apache^M >X-Powered-By: PHP/5.2.17^M >Expires: Thu, 19 Nov 1981 08:52:00 GMT^M >Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0^M >Pragma: no-cache^M >Set-Cookie: PHPSESSID=7289c28a8ac5ba1b87aead5f94bc5473; path=/^M >Content-Length: 5687^M >Connection: close^M >Content-Type: text/html^M > > > > >________________________________ > From: Patrick <la...@ya...> >To: arc...@ya... >Sent: Monday, February 27, 2012 11:04 AM >Subject: [archive-crawler] Re: H3.0 config settings to crawl password protected pages. > > > >I have got the same problem. Have you worked out the solution? > >--- In arc...@ya..., Pranay Pandey <sspranay@...> wrote: >> >> >> OK, I took a shot at configuring the beans, it doesn't seem to be helping. >> The job gets build and run but doesn't pass the authentication stage. >> Any suggestion(s)? >> >> This is what I tried : >> ------------ >> <bean id="HttpAuthenticationCredential" class="org.archive.modules.credential.HttpAuthenticationCredential"> >> Â Â Â <property name="domain" value="www.mysite.com"/> >> Â Â Â <property name="login" value="xxxxx"/> >> Â Â Â <property name="password" value="xxxxxx"/> >> </bean> >> >> <bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> >> Â Â <property name="credentials"> >> Â Â Â Â Â Â <map> >> Â Â Â Â Â Â Â Â Â Â <entry key="credentials" value-ref="HttpAuthenticationCredential" /> >> Â Â Â Â Â Â </map> >> Â Â </property> >> </bean> >> -------------- >> And inside fetchFTTP, I add the property "credentialStore" as below. >> >> Â <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP"> >> Â Â <property name="credentialStore"> >> <ref bean="credentialStore"/> >> Â </property> >> >> Thanks, >> Pranay >> >> --- On Thu, 6/16/11, Pranay Pandey <sspranay@...> wrote: >> >> From: Pranay Pandey <sspranay@...> >> Subject: [archive-crawler] H3.0 config settings to crawl password protected pages. >> To: "arc...@ya..." <arc...@ya...> >> Date: Thursday, June 16, 2011, 11:16 AM >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Â >> >> >> >> >> >> >> >> >> >> >> Â Hello, >> >> I have been looking around to get the related beans setting configuration to be able to crawl password protected sites/pages. >> Does anyone has it handy for H-3.0? >> >> Thanks! >> Pranay >> > > > > __._,_.___ Reply to sender | Reply to group | Reply via web post | Start a New Topic Messages in this topic (6) Recent Activity: * New Members 1 Visit Your Group Switch to: Text-Only, Daily Digest • Unsubscribe • Terms of Use . __,_._,___ |
From: Ilja S. <ilj...@he...> - 2012-07-27 13:16:36
|
Hello, I've been trying to use wayback machine (1.4.2., upgrading to 1.6 is not possible currently) over https, but so far, the correct configuration keeps evading me. 1) I've set a DNS alias to point to my wayback server (not sure if relevant) 2) Apache is forwarding requests to port 443 to tomcat via ajp: ProxyPass / ajp://localhost:8009/ 3) I have an access point setup as follows: <bean name="80" class="org.archive.wayback.webapp.AccessPoint"> <property name="collection" ref="localcdxcollection" /> <property name="replay" ref="archivalurlreplay" /> <property name="query"> <bean class="org.archive.wayback.query.Renderer"> <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" /> </bean> </property> <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> <property name="replayURIPrefix" value="https://my.fqdn/"/> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"> <property name="maxRecords" value="1000" /> <property name="earliestTimestamp" value="2006" /> </bean> </property> <property name="exclusionFactory" ref="static-exclusion" /> <property name="exactSchemeMatch" value="false" /> </bean> I have also another, working access point (basically localhost:8080). If I use http (also in apache), this access point works. Now I get null for my AccessPoint when I try to access it from my jsp pages. Any hints on what I am doing wrong would be greatly appreciated. Ilja Sidoroff |
From: Bert W. <bwe...@fr...> - 2012-07-19 14:20:30
|
Allen, Linux systems are often configured to perform a 'tmpwatch' [1] on a regular basis by cron to delete files in /tmp which haven't been used for a certain period of time, leaving the directories empty. So look at your system configuration if you find something like /etc/cron.daily/tmpwatch. If this is the case, you may - deactivate tmpwatch or - (better) configure it to not touch /tmp/wayback anymore or - (even better) move your wayback directory to somewhere else than /tmp. It's never good to store files that you want to keep in /tmp. Hope this helps, Bert [1] http://linux.die.net/man/8/tmpwatch On Thu, 19 Jul 2012, 17:22, Allen Sim wrote: > Hi all, > I have encountered a problem: all my harvested websites are stored in /tmp/wayback and processing in /tmp/wayback/files1. > It stores in a format as following: > 819224/1/IAH-20110710042453-00000-kgpnssrrs060.arc, IAH-20110710042453-00000-kgpnssrrs060.cdx > 819224/logs/crawl.log , progress-statistics.log, uri-errors.log > 819224/reports/crawl-manifest.txt, host-report.txt, processor-report.txt and so on.. > BUT, form time to time I noticed that all the content inside the folder will be gone blank and leave the blank folder: > 819224/1/BLANK > 819224/logs/BLANK > 819224/reports/BLANK > Luckily I have the backup. > My question: > 1. Is it because of storing my harvested at /tmp folder and from time to time the content will be removed? > 2.Is it because of my hard-disk space not sufficient and causing all the content go blank? > > Please advice and looking forward to heard from you. > > Regards, > Allen > > > |
From: Allen S. <all...@gm...> - 2012-07-19 09:23:09
|
Hi all, I have encountered a problem: all my harvested websites are stored in /tmp/wayback and processing in /tmp/wayback/files1. It stores in a format as following: 819224/1/IAH-20110710042453-00000-kgpnssrrs060.arc, IAH-20110710042453-00000-kgpnssrrs060.cdx 819224/logs/crawl.log , progress-statistics.log, uri-errors.log 819224/reports/crawl-manifest.txt, host-report.txt, processor-report.txt and so on.. BUT, form time to time I noticed that all the content inside the folder will be gone blank and leave the blank folder: 819224/1/BLANK 819224/logs/BLANK 819224/reports/BLANK Luckily I have the backup. My question: 1. Is it because of storing my harvested at /tmp folder and from time to time the content will be removed? 2.Is it because of my hard-disk space not sufficient and causing all the content go blank? Please advice and looking forward to heard from you. Regards, Allen |
From: Nicholas T. <ta...@gm...> - 2012-07-14 18:53:26
|
And...upgrading to Tomcat 6.0.35 did the trick! I was able to get Wayback 1.6.1 to run in a non-ROOT context without any modifications made to the configuration. Thanks again for your help, Lauren! ~Nicholas |
From: Nicholas T. <ta...@gm...> - 2012-07-14 18:27:22
|
Hi Lauren, thanks for the suggestions. I had checked to see whether Wayback would run before I modified the configuration (didn't make any difference), and the catalina.<datestamp>.log didn't provide any more descriptive information before the "Error filterStart" entry. I tried a few other strategies but was still unable to get it to work: 1) Set wayback.basedir=/home/sansforensics/wayback/indexes (removed the trailing slash since the places where the variable is used elsewhere didn't expect there to be one). Didn't make any difference. 2) Installed Brad's updated Wayback 1.6.1 to a non-ROOT context and tried running it without modifying the configuration. Tomcat Manager still says the application failed to start and the catalina.<datestamp>.log reports "Error filterStart". 3) Updated Wayback 1.6.1 with the configuration I laid out in my previous e-mail. Same errors. 4) Installed Wayback 1.6.1 to the ROOT context and tried to get it to run both with and without having modified the configuration. Same errors. At this point, all I can think of to do is to upgrade Tomcat 6 to a more recent version or just try running Wayback on Windows via Cygwin. I still feel like I must be missing something simple and fundamental, but I'm perplexed as to what it could be. ~Nicholas |
From: Ko, L. <Lau...@un...> - 2012-07-10 23:25:22
|
Hi Nicholas, Since I didn't see any responses to you on the list, I will respond even though I don't have a solution. I tried your wayback-1.6.0 configuration below, and did not get that error. Though my environment isn't exactly the same: Ubuntu 10.04 jdk1.6.0_07 apache-tomcat-6.0.35 Did you check to see if Wayback would run before you made the configuration changes? Was there anything else of interest in catalina.out before the filterStart error? I will say that though wayback-1.6.0 will launch for me, when running it in a non-Root context as you are, this version of Wayback hasn't worked well for me. Here are some messages about it http://sourceforge.net/mailarchive/message.php?msg_id=27425763. Lauren Ko Web Archiving Programmer UNT Libraries ________________________________________ From: Nicholas Taylor [ta...@gm...] Sent: Thursday, July 05, 2012 1:23 PM To: arc...@li... Subject: [Archive-access-discuss] error starting Wayback: "Error filterStart" Hello archive-access-discussers, I installed Wayback based on minor modifications to the instructions here: https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide but Tomcat reports "Error filterStart" when I try to start it. I'm hoping it's a simple configuration error that someone more experienced with Wayback and/or Tomcat could help me figure out. Environment: Ubuntu 9.10 Java 1.6.0_24-b07 Tomcat 6.0.20 Tomcat is running and I've been trying to start Wayback from the Tomcat Manager. Wayback is installed in the "wayback-1.6.0" context. I made the following edits to wayback.xml: wayback.basedir=/home/sansforensics/wayback/indexes/ wayback.urlprefix=http://localhost:8080/wayback-1.6.0/ Changed the bean named "8080:wayback" to "8080:test" Changed the four instances of "${wayback.urlprefix}/" in that bean to "${wayback.urlprefix}test/" I made the following changes to BDBcollection.xml: In the bean with id "datadirs" <property name="name" value="warcfiles" /> <property name="prefix" value="/home/sansforensics/wayback/warcs/" /> Both dirs: /home/sansforensics/wayback/indexes/ /home/sansforensics/wayback/warcs/ exist, are empty, and have 777 permissions. Here is sample output from a catalina.<datestamp>.log: Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors Any ideas? ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Nicholas T. <ta...@gm...> - 2012-07-05 18:24:38
|
Hello archive-access-discussers, I installed Wayback based on minor modifications to the instructions here: https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide but Tomcat reports "Error filterStart" when I try to start it. I'm hoping it's a simple configuration error that someone more experienced with Wayback and/or Tomcat could help me figure out. Environment: Ubuntu 9.10 Java 1.6.0_24-b07 Tomcat 6.0.20 Tomcat is running and I've been trying to start Wayback from the Tomcat Manager. Wayback is installed in the "wayback-1.6.0" context. I made the following edits to wayback.xml: wayback.basedir=/home/sansforensics/wayback/indexes/ wayback.urlprefix=http://localhost:8080/wayback-1.6.0/ Changed the bean named "8080:wayback" to "8080:test" Changed the four instances of "${wayback.urlprefix}/" in that bean to "${wayback.urlprefix}test/" I made the following changes to BDBcollection.xml: In the bean with id "datadirs" <property name="name" value="warcfiles" /> <property name="prefix" value="/home/sansforensics/wayback/warcs/" /> Both dirs: /home/sansforensics/wayback/indexes/ /home/sansforensics/wayback/warcs/ exist, are empty, and have 777 permissions. Here is sample output from a catalina.<datestamp>.log: Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart Jul 1, 2012 6:28:17 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart Jul 1, 2012 6:28:21 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/wayback-1.6.0] startup failed due to previous errors Any ideas? |
From: Adam M. <ad...@ar...> - 2012-06-28 21:05:02
|
I've been working on an external browser processor to plug into the processor chain within Heritrix3. It is in a pretty early stage of development, but is functional. The extractor processor is here: https://github.com/adam-miller/ExternalBrowserExtractorHTML I've worked with two different headless browsers, phantomJS and zombieJS. So far, phantom has performed the best for me. My phantomJS script is here: https://github.com/adam-miller/phantomBrowserExtractor It will not run flash, but will run javascript and log all asynchronous requests to queue them in H3. So far, the main limitation with phantomJS is that it is going to request all of the content to render the page. This will cause duplicate requests since heritrix will be downloading the content on its own. I've been working with customizing phantomJS to prevent these duplicate requests, but I don't have any code for that online yet. ~Adam Miller > >> From: Jon Walton <jon...@gm...> >> Date: June 28, 2012 11:51:11 AM PDT >> To: Erik Hetzner <eri...@uc...> >> Cc: "arc...@li..." <arc...@li...> >> Subject: Re: [Archive-access-discuss] Crawling Flash and Javascript >> >> >> >> Hi Anne, >> >> You might try the archive-crawler mailing list as well. >> >> All of us have encountered these issues. Capturing javascript & flash >> content is difficult. Replaying this content is even harder. >> >> Whether it is a Heritrix or a Wayback issue depends: it’s probably >> both. If you can figure out what content needs to be captured in order >> for a site to work, you can then check your Heritrix crawl.log files >> to see if that content was captured. Heritrix is highly configurable >> and if you discover that Heritrix is not capturing the content you >> want, you may be able to change the configuration to make it capture >> what you want. >> >> After you have ensured that you are capturing the content, you can >> begin to evaluate whether Wayback is properly replaying the content. >> Whether Wayback can or is properly replaying the content depends on >> your Wayback configuration. For example, proxy mode can probably >> replay most content correctly, while I doubt that client-side >> rewriting will ever work very well. >> >> Finally, the only real way to test if this is fixed is to try out the >> new versions of Heritrix & Wayback and evaluate the results. >> >> >> I am guessing, but it seems to me that not all web objects are being stored during the Heritrix crawl, due to the fact that Heritrix (any version) does not execute Javascript. >> >> Has anyone ever considered replacing the core Heritrix 3 web fetcher with something like HTMLUnit, which would execute Javascript via Rhino? One way to implement this would be to create an optional web client, configured via Spring, which would execute javascript to better render a page at crawl time - resulting in the inclusion of these objects. >> >> As you mentioned, this is probably something that has come up on the crawler list. >> >> Jon >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
From: Jon W. <jon...@gm...> - 2012-06-28 18:51:18
|
> > Hi Anne, > > You might try the archive-crawler mailing list as well. > > All of us have encountered these issues. Capturing javascript & flash > content is difficult. Replaying this content is even harder. > > Whether it is a Heritrix or a Wayback issue depends: it’s probably > both. If you can figure out what content needs to be captured in order > for a site to work, you can then check your Heritrix crawl.log files > to see if that content was captured. Heritrix is highly configurable > and if you discover that Heritrix is not capturing the content you > want, you may be able to change the configuration to make it capture > what you want. > > After you have ensured that you are capturing the content, you can > begin to evaluate whether Wayback is properly replaying the content. > Whether Wayback can or is properly replaying the content depends on > your Wayback configuration. For example, proxy mode can probably > replay most content correctly, while I doubt that client-side > rewriting will ever work very well. > > Finally, the only real way to test if this is fixed is to try out the > new versions of Heritrix & Wayback and evaluate the results. > > I am guessing, but it seems to me that not all web objects are being stored during the Heritrix crawl, due to the fact that Heritrix (any version) does not execute Javascript. Has anyone ever considered replacing the core Heritrix 3 web fetcher with something like HTMLUnit, which would execute Javascript via Rhino? One way to implement this would be to create an optional web client, configured via Spring, which would execute javascript to better render a page at crawl time - resulting in the inclusion of these objects. As you mentioned, this is probably something that has come up on the crawler list. Jon |
From: Erik H. <eri...@uc...> - 2012-06-28 16:55:56
|
At Wed, 27 Jun 2012 15:23:58 -0400, Leon, Anne wrote: > > Hi All, > > I have a question regarding crawling Flash and Javascript. > Currently, I am utilizing Heritrix 1.14.4 and Wayback 1.4.2 and I > have had issues capturing fully functioning websites. Websites that > utilize javascript heavily have banners missing or empty widget > boxes, and Flash content is virtually nonexistent. Within the next > few months we will be upgrading to the newest versions of both > programs, but I'm concerned that these problems will still exist. > > So, I'm wondering if any of you have encountered this issues and > what have you done to remedy them? Is this a Heritrix issue or a > Wayback issue? And lastly, did upgrading the software fix the > problems? Thank you all in advance. Hi Anne, You might try the archive-crawler mailing list as well. All of us have encountered these issues. Capturing javascript & flash content is difficult. Replaying this content is even harder. Whether it is a Heritrix or a Wayback issue depends: it’s probably both. If you can figure out what content needs to be captured in order for a site to work, you can then check your Heritrix crawl.log files to see if that content was captured. Heritrix is highly configurable and if you discover that Heritrix is not capturing the content you want, you may be able to change the configuration to make it capture what you want. After you have ensured that you are capturing the content, you can begin to evaluate whether Wayback is properly replaying the content. Whether Wayback can or is properly replaying the content depends on your Wayback configuration. For example, proxy mode can probably replay most content correctly, while I doubt that client-side rewriting will ever work very well. Finally, the only real way to test if this is fixed is to try out the new versions of Heritrix & Wayback and evaluate the results. Hope that helps! best, Erik |
From: Leon, A. <Le...@si...> - 2012-06-27 19:48:33
|
Hi All, I have a question regarding crawling Flash and Javascript. Currently, I am utilizing Heritrix 1.14.4 and Wayback 1.4.2 and I have had issues capturing fully functioning websites. Websites that utilize javascript heavily have banners missing or empty widget boxes, and Flash content is virtually nonexistent. Within the next few months we will be upgrading to the newest versions of both programs, but I'm concerned that these problems will still exist. So, I'm wondering if any of you have encountered this issues and what have you done to remedy them? Is this a Heritrix issue or a Wayback issue? And lastly, did upgrading the software fix the problems? Thank you all in advance. Anne |
From: Armin S. <Arm...@ui...> - 2012-06-19 13:55:19
|
Hello List, i have a problem with my Wayback configuration and it seems i just can't figure out what the problem is. Please excuse if this is a noob question, as i am fairly new to this. I imported a large collection of .WARC files into my local Wayback instance. Everything is being indexed and all the URLs can be found in the archive. However, it seems i can only replay the first capture of every one of these URLs. So, if for example the URL www.test.com was captured on 27.09, 03.10, and 12.12, i can only replay the capture from 27.09. Does anyone have an idea how this can be fixed? I'm very thankful for any hint you can give me...Thanks a lot in advance! |
From: Bjarne A. <bj...@st...> - 2012-05-24 15:10:15
|
Thanks Roger. I got a java program from IA - but it by default required All your content to be stored on HDFS and then using hadoop to extract content. I Don't have that setup so I gave the hanzo warc-tools a shot I tried their python code last Fall with little luck but they have actually been working on the project and it worked out of the box this time They have (among several tools) - arc2warc.py to convert to WARC - warcfilter.py to filter a WARC file by e.g. URL (regexp) So using those two it is quite easy to extract material from one or more domains. A tricky situation is still embedded content from other domains that you want to include. The IA/hadoop approach supported that by analysing crawl-logs to find URIs of embedded things found at crawltime But for this specific case the warc-tools was actually quite helpful Best Bjarne Sendt fra min iPhone Den 24/05/2012 kl. 16.53 skrev "Coram, Roger" <Rog...@bl...>: > Hi Bjarne, > > Only just saw your message. I'm not sure if you've had better responses > so far but here's a bash script I've used in the past: > > https://gist.github.com/2781979 > > It should work via, for example: arc2warc -a INPUT_ARC.arc.gz -w > OUTPUT_WARC.warc.gz -r "http://www\\.bl\\.uk" > > It does have one dependency, a Python script for stripping HTTP headers > (in order to calculate the digest of the payload): > > https://gist.github.com/2781967 > > However, you can probably remove that and include a WARC-Block-Digest or > remove it altogether. > > Roger G. Coram > Web Archiving Engineer > The British Library > E: rog...@bl... > > > -----Original Message----- > From: Bjarne Andersen [mailto:bj...@st...] > Sent: 11 May 2012 22:04 > To: arc...@li... > Subject: [Archive-access-discuss] Extracting records from ARC files into > new(W)ARC files > > Hi. > A website owner is asking for an extract of material from a specific > domain > > Anybody aware of a tool that given either complete URLs or a URL regexp > Would run though an ARC file and write All records into a new (W)ARC > file? > > Best > Bjarne Andersen > > > > Sendt fra min iPhone > ------------------------------------------------------------------------ > ------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and threat > landscape has changed and how IT managers can respond. Discussions will > include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Noah L. <nl...@ar...> - 2012-05-18 21:18:26
|
Hello Erik, https://github.com/internetarchive/wayback is the one. Noah On 2012-05-18 14:12 , Erik Hetzner wrote: > Hi all, > > Which is preferred? > > - https://github.com/internetarchive/wayback-machine > - https://github.com/internetarchive/wayback > > Thank you! > > best, Erik > > > > Sent from my free software system<http://fsf.org/>. > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Erik H. <eri...@uc...> - 2012-05-18 21:12:56
|
Hi all, Which is preferred? - https://github.com/internetarchive/wayback-machine - https://github.com/internetarchive/wayback Thank you! best, Erik |
From: Erik H. <eri...@uc...> - 2012-05-16 21:21:38
|
At Wed, 16 May 2012 13:44:11 -0700, Aaron Binns wrote: > > > Erik Hetzner <eri...@uc...> writes: > > > A quick question. UURI [1] is located in Heritrix Commons. HandyUrl is > > located in archive-commons. Which should I use? > > Hmmm, it might depend on your needs. AFAIK, the UURI is geared towards > Heritrix's needs, which includes a pretty light "normalization" of the > URL. From an archival capture point of view, I think the idea is that > Heritrix shouldn't munge the URL very much. > > However, HandyUrl is geared for access/playback/Wayback needs, and as > such incorporates stronger URL normalization/canonicalization. > > I haven't spent much time in the code for either, the above is just my > thoughts based on informal discussions with Gordon and Brad. Thanks, Aaron. It sounds like HandyUrl is more appropriate for my current task. best, Erik |