You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Erik H. <eri...@uc...> - 2012-01-20 17:55:30
|
At Fri, 20 Jan 2012 17:11:45 +0100, raffaele messuti wrote: > > i got this error trying to enqueue some warcs into wayback (1.6) > > ➜ ./bin/cdx-indexer data/warcs/jlis-20012010.warc.gz > java.io.IOException: Resetting to invalid mark > at java.io.BufferedInputStream.reset(BufferedInputStream.java:416) > at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:123) > at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:84) > at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader.<init>(WARCReaderFactory.java:221) > at org.archive.io.warc.WARCReaderFactory.getArchiveReader(WARCReaderFactory.java:88) > at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:110) > at org.archive.io.warc.WARCReaderFactory.get(WARCReaderFactory.java:63) > at org.archive.wayback.resourcestore.indexer.WarcIndexer.iterator(WarcIndexer.java:71) > at org.archive.wayback.resourcestore.indexer.IndexWorker.indexFile(IndexWorker.java:135) > at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:204) > > > warcs are made with wget-warc, > my current java version is "1.6.0_29" > > i tested with another java version (1.6.0_21) and worked > > is guess is something related with https://webarchive.jira.com/browse/HER-1865 Hi Raffaele, Yes, that is almost certainly the issue. The solution is to use the old JDK version (that is what we are doing at CDL) or upgrade wayback to 1.6.1 (not yet released). See this message from Brad Tofel: http://sourceforge.net/mailarchive/forum.php?thread_name=CCCA2F48128C1F4DAC8B38F9B49C50BE071D53%40OLADAGQP.lao.ola.org&forum_name=archive-access-discuss best, Erik |
From: raffaele m. <raf...@at...> - 2012-01-20 16:27:06
|
i got this error trying to enqueue some warcs into wayback (1.6) ➜ ./bin/cdx-indexer data/warcs/jlis-20012010.warc.gz java.io.IOException: Resetting to invalid mark at java.io.BufferedInputStream.reset(BufferedInputStream.java:416) at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:123) at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:84) at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader.<init>(WARCReaderFactory.java:221) at org.archive.io.warc.WARCReaderFactory.getArchiveReader(WARCReaderFactory.java:88) at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:110) at org.archive.io.warc.WARCReaderFactory.get(WARCReaderFactory.java:63) at org.archive.wayback.resourcestore.indexer.WarcIndexer.iterator(WarcIndexer.java:71) at org.archive.wayback.resourcestore.indexer.IndexWorker.indexFile(IndexWorker.java:135) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:204) warcs are made with wget-warc, my current java version is "1.6.0_29" i tested with another java version (1.6.0_21) and worked is guess is something related with https://webarchive.jira.com/browse/HER-1865 solutions? ciao -- raf...@at... |
From: Kris C. N. <kca...@ar...> - 2012-01-03 19:32:05
|
Web Archiving Doctoral Support Funding Available Application Deadline February 15, 2012 http://infosciencephd.unt.edu/iipc-web-archiving-doctoral-support-award The University of North Texas College of Information (http://www.ci.unt.edu) is accepting applications for a 3-year award to support doctoral studies in its Interdisciplinary Information Science Ph.D. Program (http://infosciencephd.unt.edu). The IIPC Web Archiving Doctoral Support Award is made possible by a grant from the International Internet Preservation Consortium (IIPC,http://netpreserve.org). The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations. The College of Information is collaborating with two members of the IIPC in providing a high-value, high-impact doctoral experience for the selected student. The overarching goal for this initiative is to build capacity in the academy to train and prepare future researchers and faculty members to address the multifaceted challenges of preserving and using web archives. Two IIPC members - University of North Texas Libraries (http://www.library.unt.edu) and the Internet Archive (http://www.archive.org) - will provide opportunities for hands-on practice and research to supplement and complement rigorous coursework. The awardee will also be directly engaged in activities of the IIPC. This award will be made by April 1, 2012 to a qualified applicant to begin the Ph.D. program in Denton, Texas, in Fall 2012. The IIPC Web Archiving Doctoral Support Award includes the following funding, totaling approximately $40,000 per year in financial support: * Annual scholarship to offset travel, lodging, and other living costs, as well as expenses related to coursework and study (e.g., books); provided by IIPC. * Graduate Academic Tuition Scholarship (http://www.tsgs.unt.edu/questions#GATS) that will cover all tuition and mandatory fees; provided by the UNT Toulouse School of Graduate Studies and the College of Information. * Graduate Research Assistantship providing a salary for 20 hours of work per week, plus health insurance, with assignment to UNT Libraries Digital Projects Unit; provided by the UNT Libraries and the College of Information. * Summer internship at the Internet Archive (San Francisco, CA); paid by the Graduate Research Assistantship; provided by the Internet Archive and the College of Information. Applicants must follow the special instructions to apply for the IIPC Web Archiving Doctoral Support Award. In addition, applicants must meet all general admission requirements of the UNT Toulouse School of Graduate Studies (http://www.tsgs.unt.edu) and follow the normal application process (http://infosciencephd.unt.edu/admission ) for the Interdisciplinary Information Science Ph.D. Program If you are interested in this unique opportunity and for complete details and step-by-step instructions to apply for this award, see IIPC Web Archiving Doctoral Support Award (http://infosciencephd.unt.edu/iipc-web-archiving-doctoral-support-award). |
From: <Dom...@sw...> - 2012-01-03 12:57:17
|
Hi all, I search for an option to edit archived web pages server side before the wayback machine displays them. Some of my archived pages include flash video streams. These flash videos were downloaded, stored on a server and indexed in a mysql database. Now I want to script something that looks up the flash video url in the archived web page, search mysql for the link to the downloaded video and insert this link in to the archived web page at the position of the flash video. Is the ArchivalUrlSaxReplay.xml the right way to do this? Which files do I have to edit? Thank you for any hint or advice Here is a sample source code of an archived page that includes a flash video url <div class="video512"> <script type="text/javascript"> var showplayer = true; ... player.avaible_url['flashmedia']['2'] = "rtmp://flashmedia.mdn.newmedia.nacamar.net/2009/06-29/24831913.flv"; </script> </div> Best wishes, Dominik |
From: Erik H. <eri...@uc...> - 2011-12-12 19:43:50
|
At Mon, 12 Dec 2011 12:36:18 +0000, Matjaž Kragelj wrote: > > Hi everyone, > > > I'll go straight to the problem: > > We have (on VMware) a server (Quad-core AMD Opteron, 2.4Ghz, (4 processors), 20GB RAM, Windows server 2008 R2, 64 bit) > Apache Tomcat 7.0.23 and aplications Web Curator Tool and 2 instances of Wayback machine (version 1.6 and 1.6.1) > Apache Tomcat 6 - for Solr Lucine - full text search > > Tomcat 7 has 9GB Ram, Tomcat 6 has 5Gb. > I tried with several initial and maximum memory pool in tomcat (from 6GB max to 13Gb max) but it is pretty much the same after an hour or so. > Last few days Wayback reindexed data (89GB, 28.000 files in sub directories) and I had to stop all other applications for a few days, because the process (create index for 27.000 files took almost 3 days) > > So, the problem we have is very, very slow Tomcat. We prectically get timeout every time using WCT. It takes a whole minute to get welcome screen.Wayback is also very very slow. Now, wher Tomcat (7 and 6) got 13.5 gb together, windows consuption of RAM is approx 15GB. > > Java opts is: > -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Xms768m -Xmx13000m > > JVM version is 1.6.0_21-b07 > I just tried with disabling Paging file, we'll see the difference.. > > I have a jsp script which shows the consumption of Java - here is a static html file: http://www.nuk.uni-lj.si/java.htm > > My question is - is this normal? Am I doing something wrong or do I really need more than 20GB of RAM to run Wayback machine and WCT on the same machine. > > If anyone has an idea - please share it with me. > If anyone needs more data to understand the situation - here I am.. Hi Matjaž, This might not help you much. We use wayback with CDX files with far less memory usage and quick response. I don’t know if this is possible with WCT, but you might want to give it a look. It is trickier to set up, but it works well. best, Erik |
From: Matjaž K. <Mat...@nu...> - 2011-12-12 12:49:30
|
Hi everyone, I'll go straight to the problem: We have (on VMware) a server (Quad-core AMD Opteron, 2.4Ghz, (4 processors), 20GB RAM, Windows server 2008 R2, 64 bit) Apache Tomcat 7.0.23 and aplications Web Curator Tool and 2 instances of Wayback machine (version 1.6 and 1.6.1) Apache Tomcat 6 - for Solr Lucine - full text search Tomcat 7 has 9GB Ram, Tomcat 6 has 5Gb. I tried with several initial and maximum memory pool in tomcat (from 6GB max to 13Gb max) but it is pretty much the same after an hour or so. Last few days Wayback reindexed data (89GB, 28.000 files in sub directories) and I had to stop all other applications for a few days, because the process (create index for 27.000 files took almost 3 days) So, the problem we have is very, very slow Tomcat. We prectically get timeout every time using WCT. It takes a whole minute to get welcome screen.Wayback is also very very slow. Now, wher Tomcat (7 and 6) got 13.5 gb together, windows consuption of RAM is approx 15GB. Java opts is: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Xms768m -Xmx13000m JVM version is 1.6.0_21-b07 I just tried with disabling Paging file, we'll see the difference.. I have a jsp script which shows the consumption of Java - here is a static html file: http://www.nuk.uni-lj.si/java.htm My question is - is this normal? Am I doing something wrong or do I really need more than 20GB of RAM to run Wayback machine and WCT on the same machine. If anyone has an idea - please share it with me. If anyone needs more data to understand the situation - here I am.. Best wishes, Matjaž |
From: Bradley T. <br...@ar...> - 2011-12-09 01:14:39
|
Hi Armin, One other possibility, assuming you're using the automatic indexing systems in Wayback (the BDBIndex) is to look in your wayback directory under ".../index-data/merged/" where Wayback keeps a copy of the same CDX files that the "cdx-indexer" tool will create. Column 1 is the "canonicalized" (normalized) URL, and column 3 is the original URL. Brad On 12/6/11 11:15 AM, Aaron Binns wrote: > Armin Schleicher<Arm...@ui...> writes: > >> Thanks for your reply! I would like to get a list of the urls in my >> local wayback deployment. > The Wayback Machine install package comes with a command-line tool for > generating a CDX file for an ARC or WARC file, e.g. > > ${wayback-install}/bin/cdx-indexer > > You can run it on your (w)arc files, one at a time, like this > > $ cdx-indexer foo.arc.gz foo.cdx > > which reads foo.arc.gz and puts the index into foo.cdx. > > By default, the first column of the resulting foo.cdx file is the URL of > the record. There is one line in the CDX per record in the (w)arc. > > > Hope that helps, > > Aaron > > > ------------------------------------------------------------------------------ > Cloud Services Checklist: Pricing and Packaging Optimization > This white paper is intended to serve as a reference, checklist and point of > discussion for anyone considering optimizing the pricing and packaging model > of a cloud services business. Read Now! > http://www.accelacomm.com/jaw/sfnl/114/51491232/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Aaron B. <aa...@ar...> - 2011-12-06 19:15:36
|
Armin Schleicher <Arm...@ui...> writes: > Thanks for your reply! I would like to get a list of the urls in my > local wayback deployment. The Wayback Machine install package comes with a command-line tool for generating a CDX file for an ARC or WARC file, e.g. ${wayback-install}/bin/cdx-indexer You can run it on your (w)arc files, one at a time, like this $ cdx-indexer foo.arc.gz foo.cdx which reads foo.arc.gz and puts the index into foo.cdx. By default, the first column of the resulting foo.cdx file is the URL of the record. There is one line in the CDX per record in the (w)arc. Hope that helps, Aaron |
From: Aaron B. <aa...@ar...> - 2011-12-06 18:44:10
|
Hello, I'm not sure exactly what you mean by "access the Wayback database". Do you mean: 1 The list of URLs on waybackmachine.org 2 The list of URLs in your local wayback deployment 3 The list of URLs in your Archive-It collections Since you are downloading the (w)arc files from Archive-It and putting them into your own Wayback installation, what information are you missing? Aaron |
From: Armin S. <Arm...@ui...> - 2011-12-06 08:57:24
|
Hello List, i am wondering if it is possible to access the Wayback database. I am currently designing an archive website and at the moment we are harvesting using archive-it!. I have a cronjob running that fetches the new arc files and imports them into my local Wayback install. What i would like to do is to check what URL's are accessible via Wayback to put them into my database. My second option, since i create a Lucene index using NutchWAX woud be to get all the URL fields from there, but it seems like a workaround to me... Thanks for your help, Bests Armin |
From: Allen S. <all...@gm...> - 2011-11-23 01:50:03
|
Dear Finn Bradley, Good news, I can replay partial of my web content now, but still have the FAILED CANNONICALIZE warnings. I think nothing wrong with the datadir. Any updates or anything I need to check for the Failed Cannonicalize? Please advice. Thank you. Regards, Allen On Tue, Nov 22, 2011 at 9:40 AM, Allen Sim <all...@gm...> wrote: > Hi Finn, Bradley L, > 1. I have replaced my entire datadirs, but the result still the same, as I > look into the catalina log file, FAILED CANONICALIZE still appeared. > 2. The reason for files 2 is because I configure it such way so that > wayback will look for *any* ARC/WARC files under > /tmp/arcstores,automatically. > > Please advice and help. I am vry lost... > Thanks in advance. > > Regards, > Allen > > > > On Tue, Nov 22, 2011 at 5:39 AM, Finn, Bradley L < > bra...@ed...> wrote: > >> I have told you before….**** >> >> ** ** >> >> You haven’t enabled recurse on your files1 bean and I don’t know why you >> have a files2 bean.**** >> >> ** ** >> >> Replace your entire datadirs with this:**** >> >> ** ** >> >> <bean id="datadirs" >> class="org.springframework.beans.factory.config.ListFactoryBean">**** >> >> <property name="sourceList">**** >> >> <list>**** >> >> <bean >> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> >> **** >> >> <property name="name" value="arcs" />**** >> >> <property name="prefix" value="/tmp/wayback/files1/" />**** >> >> <property name="recurse" value="true" />**** >> >> </bean>**** >> >> </list>**** >> >> </property>**** >> >> </bean>**** >> >> ** ** >> >> Then re-index.**** >> >> ** ** >> >> *From:* Allen Sim [mailto:all...@gm...] >> *Sent:* Monday, 21 November 2011 5:17 PM >> *To:* arc...@li...; >> web...@li... >> *Subject:* [Webcurator-users] Wayback reindex Failed Canonicalize problem >> **** >> >> ** ** >> >> Hi all, >> I tried to reindex and replay back my harvested websites. >> I stopped my tomcat, deleted my temp/wayback file then recreate >> tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into >> tmp/wayback/files1 and lastly restart my tomcat again. >> But i cannot replay back my harvested websites. I checked at my >> catalina.log file and noticed that inside a lot of warning saying that >> "FAILED CANONICALIZE". >> Following is my BDBCollection.xml: >> <bean id="datadirs" >> class="org.springframework.beans.factory.config.ListFactoryBean"> >> <property name="sourceList"> >> <list> >> <bean >> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> >> <property name="name" value="files1" /> >> <property name="prefix" value="/tmp/wayback/files1/" /> >> </bean> >> <bean >> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> >> <property name="name" value="files2" /> >> <property name="prefix" value="/tmp/arcstore/" /> >> <property name="recurse" value="true" /> >> </bean> >> >> Anthing wrong??? >> I need your help and guidance. >> Thanks in advance. >> >> Regards, >> Allen **** >> >> >> >> ----------------------------------------------------------------------------- >> CONFIDENTIALITY NOTICE AND DISCLAIMER >> >> Information in this transmission is intended only for the person(s) to >> whom it is addressed and may contain privileged and/or confidential >> information. If you are not the intended recipient, any disclosure, copying >> or dissemination of the information is unauthorised and you should >> delete/destroy all copies and notify the sender. No liability is accepted >> for any unauthorised use of the information contained in this transmission. >> >> This disclaimer has been automatically added. >> > > |
From: Allen S. <all...@gm...> - 2011-11-22 01:40:29
|
Hi Finn, Bradley L, 1. I have replaced my entire datadirs, but the result still the same, as I look into the catalina log file, FAILED CANONICALIZE still appeared. 2. The reason for files 2 is because I configure it such way so that wayback will look for *any* ARC/WARC files under /tmp/arcstores,automatically. Please advice and help. I am vry lost... Thanks in advance. Regards, Allen On Tue, Nov 22, 2011 at 5:39 AM, Finn, Bradley L < bra...@ed...> wrote: > I have told you before….**** > > ** ** > > You haven’t enabled recurse on your files1 bean and I don’t know why you > have a files2 bean.**** > > ** ** > > Replace your entire datadirs with this:**** > > ** ** > > <bean id="datadirs" > class="org.springframework.beans.factory.config.ListFactoryBean">**** > > <property name="sourceList">**** > > <list>**** > > <bean > class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> > **** > > <property name="name" value="arcs" />**** > > <property name="prefix" value="/tmp/wayback/files1/" />**** > > <property name="recurse" value="true" />**** > > </bean>**** > > </list>**** > > </property>**** > > </bean>**** > > ** ** > > Then re-index.**** > > ** ** > > *From:* Allen Sim [mailto:all...@gm...] > *Sent:* Monday, 21 November 2011 5:17 PM > *To:* arc...@li...; > web...@li... > *Subject:* [Webcurator-users] Wayback reindex Failed Canonicalize problem* > *** > > ** ** > > Hi all, > I tried to reindex and replay back my harvested websites. > I stopped my tomcat, deleted my temp/wayback file then recreate > tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into > tmp/wayback/files1 and lastly restart my tomcat again. > But i cannot replay back my harvested websites. I checked at my > catalina.log file and noticed that inside a lot of warning saying that > "FAILED CANONICALIZE". > Following is my BDBCollection.xml: > <bean id="datadirs" > class="org.springframework.beans.factory.config.ListFactoryBean"> > <property name="sourceList"> > <list> > <bean > class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> > <property name="name" value="files1" /> > <property name="prefix" value="/tmp/wayback/files1/" /> > </bean> > <bean > class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> > <property name="name" value="files2" /> > <property name="prefix" value="/tmp/arcstore/" /> > <property name="recurse" value="true" /> > </bean> > > Anthing wrong??? > I need your help and guidance. > Thanks in advance. > > Regards, > Allen **** > > > > ----------------------------------------------------------------------------- > CONFIDENTIALITY NOTICE AND DISCLAIMER > > Information in this transmission is intended only for the person(s) to > whom it is addressed and may contain privileged and/or confidential > information. If you are not the intended recipient, any disclosure, copying > or dissemination of the information is unauthorised and you should > delete/destroy all copies and notify the sender. No liability is accepted > for any unauthorised use of the information contained in this transmission. > > This disclaimer has been automatically added. > |
From: Finn, B. L <bra...@ed...> - 2011-11-21 21:39:36
|
I have told you before.... You haven't enabled recurse on your files1 bean and I don't know why you have a files2 bean. Replace your entire datadirs with this: <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF ileSource"> <property name="name" value="arcs" /> <property name="prefix" value="/tmp/wayback/files1/" /> <property name="recurse" value="true" /> </bean> </list> </property> </bean> Then re-index. From: Allen Sim [mailto:all...@gm...] Sent: Monday, 21 November 2011 5:17 PM To: arc...@li...; web...@li... Subject: [Webcurator-users] Wayback reindex Failed Canonicalize problem Hi all, I tried to reindex and replay back my harvested websites. I stopped my tomcat, deleted my temp/wayback file then recreate tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into tmp/wayback/files1 and lastly restart my tomcat again. But i cannot replay back my harvested websites. I checked at my catalina.log file and noticed that inside a lot of warning saying that "FAILED CANONICALIZE". Following is my BDBCollection.xml: <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF ileSource"> <property name="name" value="files1" /> <property name="prefix" value="/tmp/wayback/files1/" /> </bean> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF ileSource"> <property name="name" value="files2" /> <property name="prefix" value="/tmp/arcstore/" /> <property name="recurse" value="true" /> </bean> Anthing wrong??? I need your help and guidance. Thanks in advance. Regards, Allen ----------------------------------------------------------------------------- CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. |
From: Allen S. <all...@gm...> - 2011-11-21 06:16:43
|
Hi all, I tried to reindex and replay back my harvested websites. I stopped my tomcat, deleted my temp/wayback file then recreate tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into tmp/wayback/files1 and lastly restart my tomcat again. But i cannot replay back my harvested websites. I checked at my catalina.log file and noticed that inside a lot of warning saying that "FAILED CANONICALIZE". Following is my BDBCollection.xml: <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="files1" /> <property name="prefix" value="/tmp/wayback/files1/" /> </bean> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="files2" /> <property name="prefix" value="/tmp/arcstore/" /> <property name="recurse" value="true" /> </bean> Anthing wrong??? I need your help and guidance. Thanks in advance. Regards, Allen |
From: <al...@ai...> - 2011-11-14 18:49:19
|
Hello, I was able to checkout image-search plugin. In Readme file it is stated that one must use nutch-1.0-dev. However, I was unable to find this release in Nutch repository and on the net. I tried to use nutch-1.0. However, when I try ant tar in imagsearch folder it gives errors like error: cannot find symbol [javac] extends org.apache.hadoop.mapred.OutputFormatBase<WritableComparable, LuceneDocumentWrapper> { [javac] ^ [javac] symbol: class OutputFormatBase [javac] location: package org.apache.hadoop.mapred Could you please let me know how these errors could be fixed? Thanks. Alex. |
From: Erik H. <eri...@uc...> - 2011-11-01 20:08:16
|
At Tue, 1 Nov 2011 15:45:59 -0400, Mat Kelly wrote: > > rafaele, > I followed the installation at: > https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide > , have a functional Wayback 1.6 instance but 'find' brings up nothing > for cdx-indexer. Where is the "./bin" directory? It’s in the tar.gz best, Erik |
From: Mat K. <mk...@cs...> - 2011-11-01 19:46:07
|
rafaele, I followed the installation at: https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide , have a functional Wayback 1.6 instance but 'find' brings up nothing for cdx-indexer. Where is the "./bin" directory? Thank you, Mat On Wed, Oct 26, 2011 at 9:52 AM, raffaele messuti <raf...@at...> wrote: > > On Oct 25, 2011, at 8:07 PM, Erik Hetzner wrote: >> It seems warc-indexer has been removed from 1.6. Try the 1.4 >> distribution. It would be easy to create a warc-indexer script for >> 1.6; the WarcIndexer class is still there, though it seems to be >> lacking a main method. > > > it's called cdx-indexer in 1.6, always in ./bin > > ciao > > -- > raf...@at... > > > > > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
From: raffaele m. <raf...@at...> - 2011-10-27 14:08:01
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 26, 2011, at 7:06 PM, Erik Hetzner wrote: > Thanks for the tip! I was confused, as the source for cdx-indexer > says: > > ## This script creates a CDX file for all ARC files in a directory > ## PUTs those CDX files into a remote pipeline, and informs a remote > ## LocationDB of the locations of all the ARC files. > > which is very different behavior. I have filed a ticket on the wayback > jira. hi Herik, check also this message that Bradley posted here some months ago http://sourceforge.net/mailarchive/message.php?msg_id=27948009 the cdx output of cdx-indexer has an extra field than warc-indexer ciao - -- raf...@at... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (Darwin) iEYEARECAAYFAk6pZZoACgkQNEBieznDNrxmpwCfZhyYdKnBWDqYXdA0Y8RLKQcj 7o4AoNU9f5j0xnvkR8ldtAbYslBGS97Y =J5TZ -----END PGP SIGNATURE----- |
From: Erik H. <eri...@uc...> - 2011-10-26 17:06:36
|
At Wed, 26 Oct 2011 15:32:42 +0200, raffaele messuti wrote: > > > On Oct 25, 2011, at 8:07 PM, Erik Hetzner wrote: > > It seems warc-indexer has been removed from 1.6. Try the 1.4 > > distribution. It would be easy to create a warc-indexer script for > > 1.6; the WarcIndexer class is still there, though it seems to be > > lacking a main method. > > > it's called cdx-indexer in 1.6, always in ./bin > > ciao Hi Raffaele, Thanks for the tip! I was confused, as the source for cdx-indexer says: ## This script creates a CDX file for all ARC files in a directory ## PUTs those CDX files into a remote pipeline, and informs a remote ## LocationDB of the locations of all the ARC files. which is very different behavior. I have filed a ticket on the wayback jira. best, Erik |
From: raffaele m. <raf...@at...> - 2011-10-26 13:52:02
|
On Oct 25, 2011, at 8:07 PM, Erik Hetzner wrote: > It seems warc-indexer has been removed from 1.6. Try the 1.4 > distribution. It would be easy to create a warc-indexer script for > 1.6; the WarcIndexer class is still there, though it seems to be > lacking a main method. it's called cdx-indexer in 1.6, always in ./bin ciao -- raf...@at... |
From: raffaele m. <raf...@at...> - 2011-10-26 13:47:04
|
i learned yesterday the existence of wget-warc (great job of @archiveteam) http://www.archiveteam.org/index.php?title=Wget_with_WARC_output https://github.com/alard/wget-warc/ the git trunk doesn't compile for me, but this works fine https://github.com/downloads/alard/wget-warc/wget-warc-20111017.tar.bz2 someone here ever tested it? i'm doing a small crawl right now, i'll test into wayback soon USAGE: $ /opt/wget-warc/bin/wget --help | grep warc --warc-file=FILENAME save request/response data to a .warc.gz file. --warc-header=STRING insert STRING into the warcinfo record. --warc-max-size=NUMBER set maximum size of WARC files to NUMBER. --warc-cdx write CDX index files. --warc-dedup=FILENAME do not store records listed in this CDX file. --no-warc-compression do not compress WARC files with GZIP. --no-warc-digests do not calculate SHA1 digests. --no-warc-keep-log do not store the log file in a WARC record. --warc-tempdir=DIRECTORY location for temporary files created by the greets -- raf...@at... |
From: Erik H. <eri...@uc...> - 2011-10-25 18:07:44
|
At Tue, 25 Oct 2011 12:45:20 -0400, Mat Kelly wrote: > > Lauren, > I added a Content-Length field representative of the content plus the > header still to no avail. I am extremely interested in getting this > working and hope to pursue Erik's suggestion of validating the WARC > with the "warc-indexer binary distributed with wayback" but have a > subtle problem in that I do not know where to find this binary to > invoke. I have a working Wayback installation on Ubuntu Linux. Where > would I find this warc-indexer binary? Hi Mat, It seems warc-indexer has been removed from 1.6. Try the 1.4 distribution. It would be easy to create a warc-indexer script for 1.6; the WarcIndexer class is still there, though it seems to be lacking a main method. best, Erik |
From: Mat K. <mk...@cs...> - 2011-10-25 16:45:28
|
Lauren, I added a Content-Length field representative of the content plus the header still to no avail. I am extremely interested in getting this working and hope to pursue Erik's suggestion of validating the WARC with the "warc-indexer binary distributed with wayback" but have a subtle problem in that I do not know where to find this binary to invoke. I have a working Wayback installation on Ubuntu Linux. Where would I find this warc-indexer binary? Thank you, Mat ---------- Forwarded message ---------- From: Erik Hetzner <eri...@uc...> Date: Thu, Oct 20, 2011 at 1:06 PM Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance To: "me...@ma..." <me...@ma...> At Wed, 19 Oct 2011 22:46:37 +0000, Ko, Lauren wrote: > > Hi Mat, > > I don't think the warcvalid.py actually does a very thorough check > of what is in your WARC at validation, so you shouldn't rely on > this. > > […] You can use the warc-indexer binary, distributed with wayback, to check to see if wayback can read your warc file, which would be a pretty good indication that you have a valid WARC. best, Erik |
From: Erik H. <eri...@uc...> - 2011-10-20 17:05:48
|
At Wed, 19 Oct 2011 22:46:37 +0000, Ko, Lauren wrote: > > Hi Mat, > > I don't think the warcvalid.py actually does a very thorough check > of what is in your WARC at validation, so you shouldn't rely on > this. > > […] You can use the warc-indexer binary, distributed with wayback, to check to see if wayback can read your warc file, which would be a pretty good indication that you have a valid WARC. best, Erik |
From: Ko, L. <Lau...@un...> - 2011-10-19 22:51:36
|
Hi Mat, I don't think the warcvalid.py actually does a very thorough check of what is in your WARC at validation, so you shouldn't rely on this. Looking at your WARC, I think one problem may be the Content-Length you are giving in your WARC record headers. Are you calculating that manually? According to the WARC specification, "A WARC record shall consist of a record header followed by a record content block and two new lines." The Content-Length should give the size of the entire block following the WARC headers. Looking at your response record that gives "Content-Length: 39" I see that this size is only calculating for the payload of the record, that is "<html><body>Hello World!</body></html>". It should also include the HTTP headers that you added, as they are part of the block. If you create your record using the Hanzo tools, if I remember correctly, it will calculate and add the Content-Length field automatically if you don't supply it. Whether it does or not though, you should look at that Content-Length field and what you are putting there. Lauren Ko Web Archiving Programmer UNT Libraries ________________________________________ From: Mat Kelly [mk...@cs...] Sent: Wednesday, October 19, 2011 2:33 PM To: arc...@li... Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance Hello, I have fabricated a WARC file with the help I have thus far obtained on this forum but am having difficulty getting Wayback to display the data contained within the record. I am able to add WARCs from other sources to my Wayback instance after adding my fabricated one and have them displayed, but the content within mine never displays. I have used the tools from Hanzo Archives (http://code.hanzoarchives.com/), particularly the warc-valid.py to assure that my WARC file has no trivial issues. warc-valid assures me that my WARC file is valid. How do I get this WARC file to be displayed in my Wayback instance? I have attached the WARC file. Thank you, Mat On Wed, Oct 5, 2011 at 12:23 PM, Bradley Tofel <br...@ar...> wrote: > HTTP headers are considered part of the response, and part of the archival > record - if it's possible to save them within your system, I'd suggest > grabbing them going forward, and also that you considering using Heritrix > for your archiving. Once you have it running, you'll have standard formats > available, tools that the rest of the community is using (a great resource > for getting help), and a lot of features that will be cumbersome to > replicate. > > You could fabricate the HTTP headers yourself for previously archived > materials - Wayback will need them to replay content. > > As to the question about getting your new content indexed with Wayback, > you'll need to either rename the file, so Wayback notices it as new content, > or reset your indexing directory state: > > * stop Tomcat > * delete all files under .../wayback/{index,index-data,file-db} > * place new W/ARC files under .../wayback/files{1,2} > * start Tomcat > > Hope this helps, > > Brad > > On 10/5/11 6:20 AM, Mat Kelly wrote: > > Brad, > I did not realize Wayback would consider uncompressed WARCs. That > information will be useful. I was also considering the ARC format to > get around my WARC issues but have only recently begun to explore > that. > > Regarding your questions, I do not currently collect HTTP headers for > my data. I have created a tool that essentially saves a certain type > of webpage and all associated media to a local directory and retains > information such as time of archiving and original URI as metadata. > Are HTTP headers critical for the format? Could they be artificially > created to comply with the standard? I do know Java and was also > looking into the three projects that Erik (thanks!) suggested to > extract some of the code for my use or at least get a basis for > porting the code but the WARC format seems pretty coupled with the > rest of each package. > > >From the truncating scheme I described in a past message, why should > it not work if it is simply truncating off records? Should something > else be adjusted in the resulting file to account for the difference > in length and/or record count? > > Thanks, > Mat > > > ---------- Forwarded message ---------- > From: Bradley Tofel <br...@ar...> > Date: Tue, Oct 4, 2011 at 9:28 PM > Subject: Re: [Archive-access-discuss] WARC Manipulation and manually > creating WARCs: Need guidance > To: "me...@ma..." <me...@ma...> > > > Hi Mat, > > Another solution to side-step the compression complexities while you > work on the WARC format issues, would be using uncompressed WARC files > - just skip the compress step altogether (be sure to remove the ".gz" > suffix) > > Wayback should handle those fine - note you do still need to create > WARC records to encapsulate the archived content, but this may lower > the bar to some iterative testing. > > A couple questions to help steer you in the right direction: > > 1) do you have HTTP response headers for your archived content? > 2) do you know Java? > > Brad > > On 10/4/11 5:09 PM, Erik Hetzner wrote: > > At Tue, 4 Oct 2011 20:02:01 -0400, > Mat Kelly wrote: > > Erik, > Thank you for the reply. Please do send your script, as it might be > helpful. From the procedure above, I was hoping to create a base case > WARC and if I am not doing so properly, is there a bare bones template > to create a WARC file? Once I am familiar enough with the > procedure/structure, I plan to write a script to do the work but > wanted first to understand how I go about constructing a WARC. Please > supply any insight you can, as I am just learning about this system. > > Hi Mat, > > Attached. > > As far as I know there is no template to create a WARC file. > > You might want to have a look at the warc-tools project [1] or the > it.unimi tools as well as the heritrix commons tools [3]. > > best, Erik > > 1. http://code.hanzoarchives.com/ > 2. > http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html > 3. > http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ > > > > Sent from my free software system <http://fsf.org/>. > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |