archive-access-discuss Mailing List for Web Archive Access Utilities (Page 7)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 5 6 7 8 9 .. 43 > >> (Page 7 of 43)

Re: [Archive-access-discuss] GzippedInputStream error

From: Erik H. <eri...@uc...> - 2012-01-20 17:55:30

At Fri, 20 Jan 2012 17:11:45 +0100,
raffaele messuti wrote:
> 
> i got this error trying to enqueue some warcs into wayback (1.6)
> 
> ➜   ./bin/cdx-indexer data/warcs/jlis-20012010.warc.gz 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(BufferedInputStream.java:416)
> 	at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:123)
> 	at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:84)
> 	at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader.<init>(WARCReaderFactory.java:221)
> 	at org.archive.io.warc.WARCReaderFactory.getArchiveReader(WARCReaderFactory.java:88)
> 	at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:110)
> 	at org.archive.io.warc.WARCReaderFactory.get(WARCReaderFactory.java:63)
> 	at org.archive.wayback.resourcestore.indexer.WarcIndexer.iterator(WarcIndexer.java:71)
> 	at org.archive.wayback.resourcestore.indexer.IndexWorker.indexFile(IndexWorker.java:135)
> 	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:204)
> 
> 
> warcs are made with wget-warc,
> my current java version is "1.6.0_29"
> 
> i tested with another java version (1.6.0_21) and worked
> 
> is guess is something related with https://webarchive.jira.com/browse/HER-1865

Hi Raffaele,

Yes, that is almost certainly the issue. The solution is to use the
old JDK version (that is what we are doing at CDL) or upgrade wayback
to 1.6.1 (not yet released). See this message from Brad Tofel:

  http://sourceforge.net/mailarchive/forum.php?thread_name=CCCA2F48128C1F4DAC8B38F9B49C50BE071D53%40OLADAGQP.lao.ola.org&forum_name=archive-access-discuss

best, Erik

[Archive-access-discuss] GzippedInputStream error

From: raffaele m. <raf...@at...> - 2012-01-20 16:27:06

i got this error trying to enqueue some warcs into wayback (1.6)

➜   ./bin/cdx-indexer data/warcs/jlis-20012010.warc.gz 
java.io.IOException: Resetting to invalid mark
	at java.io.BufferedInputStream.reset(BufferedInputStream.java:416)
	at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:123)
	at org.archive.io.GzippedInputStream.<init>(GzippedInputStream.java:84)
	at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader.<init>(WARCReaderFactory.java:221)
	at org.archive.io.warc.WARCReaderFactory.getArchiveReader(WARCReaderFactory.java:88)
	at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:110)
	at org.archive.io.warc.WARCReaderFactory.get(WARCReaderFactory.java:63)
	at org.archive.wayback.resourcestore.indexer.WarcIndexer.iterator(WarcIndexer.java:71)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.indexFile(IndexWorker.java:135)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:204)


warcs are made with wget-warc,
my current java version is "1.6.0_29"

i tested with another java version (1.6.0_21) and worked

is guess is something related with https://webarchive.jira.com/browse/HER-1865

solutions?
ciao


--
raf...@at...

[Archive-access-discuss] Web Archiving Doctoral Support Funding Available

From: Kris C. N. <kca...@ar...> - 2012-01-03 19:32:05

Web Archiving Doctoral Support Funding Available

Application Deadline February 15, 2012

http://infosciencephd.unt.edu/iipc-web-archiving-doctoral-support-award

The University of North Texas College of Information (http://www.ci.unt.edu) is accepting applications for a 3-year award to support doctoral studies in its Interdisciplinary Information Science Ph.D. Program (http://infosciencephd.unt.edu). The IIPC Web Archiving Doctoral Support Award is made possible by a grant from the International Internet Preservation Consortium (IIPC,http://netpreserve.org). The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.

The College of Information is collaborating with two members of the IIPC in providing a high-value, high-impact doctoral experience for the selected student. The overarching goal for this initiative is to build capacity in the academy to train and prepare future researchers and faculty members to address the multifaceted challenges of preserving and using web archives. Two IIPC members - University of North Texas Libraries (http://www.library.unt.edu) and the Internet Archive (http://www.archive.org) - will provide opportunities for hands-on practice and research to supplement and complement rigorous coursework. The awardee will also be directly engaged in activities of the IIPC. This award will be made by April 1, 2012 to a qualified applicant to begin the Ph.D. program in Denton, Texas, in Fall 2012.

The IIPC Web Archiving Doctoral Support Award includes the following funding, totaling approximately $40,000 per year in financial support:

* Annual scholarship to offset travel, lodging, and other living costs, as well as expenses related to coursework and study (e.g., books); provided by IIPC.

* Graduate Academic Tuition Scholarship (http://www.tsgs.unt.edu/questions#GATS) that will cover all tuition and mandatory fees; provided by the UNT Toulouse School of Graduate Studies and the College of Information.

* Graduate Research Assistantship providing a salary for 20 hours of work per week, plus health insurance, with assignment to UNT Libraries Digital Projects Unit; provided by the UNT Libraries and the College of Information.

* Summer internship at the Internet Archive (San Francisco, CA); paid by the Graduate Research Assistantship; provided by the Internet Archive and the College of Information.

Applicants must follow the special instructions to apply for the IIPC Web Archiving Doctoral Support Award. In addition, applicants must meet all general admission requirements of the UNT Toulouse School of Graduate Studies (http://www.tsgs.unt.edu) and follow the normal application process (http://infosciencephd.unt.edu/admission ) for the Interdisciplinary Information Science Ph.D. Program If you are interested in this unique opportunity and for complete details and step-by-step instructions to apply for this award, see IIPC Web Archiving Doctoral Support Award (http://infosciencephd.unt.edu/iipc-web-archiving-doctoral-support-award).

[Archive-access-discuss] Edit archived pages server side

From: <Dom...@sw...> - 2012-01-03 12:57:17

Hi all,

I search for an option to edit archived web pages server side before the
wayback machine displays them.

Some of my archived pages include flash video streams. These flash videos
were downloaded, stored on a server and indexed in a mysql database. Now I
want to script something that looks up the flash video url in the archived
web page, search mysql for the link to the downloaded video and insert this
link in to the archived web page at the position of the flash video.

Is the ArchivalUrlSaxReplay.xml the right way to do this? Which files do I
have to edit?

Thank you for any hint or advice


Here is a sample source code of an archived page that includes a flash
video url

<div class="video512">
	<script type="text/javascript">
	var showplayer = true;
	...
	player.avaible_url['flashmedia']['2'] =
"rtmp://flashmedia.mdn.newmedia.nacamar.net/2009/06-29/24831913.flv";
	</script>
</div>


Best wishes,

Dominik

Re: [Archive-access-discuss] very slow Tomcat 7

From: Erik H. <eri...@uc...> - 2011-12-12 19:43:50

At Mon, 12 Dec 2011 12:36:18 +0000,
Matjaž Kragelj wrote:
> 
> Hi everyone,
> 
> 
> I'll go straight to the problem:
> 
> We have (on VMware) a server (Quad-core AMD Opteron, 2.4Ghz, (4 processors), 20GB RAM, Windows server 2008 R2, 64 bit)
> Apache Tomcat 7.0.23 and aplications Web Curator Tool and 2 instances of Wayback machine (version 1.6 and 1.6.1)
> Apache Tomcat 6 - for Solr Lucine - full text search
> 
> Tomcat 7 has 9GB Ram, Tomcat 6 has 5Gb.
> I tried with several initial and maximum memory pool in tomcat (from 6GB max to 13Gb max) but it is pretty much the same after an hour or so.
> Last few days Wayback reindexed data (89GB, 28.000 files in sub directories) and I had to stop all other applications for a few days, because the process (create index for 27.000 files took almost 3 days)
> 
> So, the problem we have is very, very slow Tomcat. We prectically get timeout every time using WCT. It takes a whole minute to get welcome screen.Wayback is also very very slow.  Now, wher Tomcat (7 and 6) got 13.5 gb together, windows consuption of RAM is approx 15GB.
> 
> Java opts is:
> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Xms768m -Xmx13000m
> 
> JVM version is 1.6.0_21-b07
> I just tried with disabling Paging file, we'll see the difference..
> 
> I have a jsp script which shows the consumption of Java - here is a static html file: http://www.nuk.uni-lj.si/java.htm
> 
> My question is - is this normal? Am I doing something wrong or do I really need more than 20GB of RAM to run Wayback machine and WCT on the same machine.
> 
> If anyone has an idea - please share it with me.
> If anyone needs more data to understand the situation - here I am..

Hi Matjaž,

This might not help you much.

We use wayback with CDX files with far less memory usage and quick
response. I don’t know if this is possible with WCT, but you might
want to give it a look. It is trickier to set up, but it works well.

best, Erik

[Archive-access-discuss] very slow Tomcat 7

From: Matjaž K. <Mat...@nu...> - 2011-12-12 12:49:30

Hi everyone,

I'll go straight to the problem:

We have (on VMware) a server (Quad-core AMD Opteron, 2.4Ghz, (4 processors), 20GB RAM, Windows server 2008 R2, 64 bit)
Apache Tomcat 7.0.23 and aplications Web Curator Tool and 2 instances of Wayback machine (version 1.6 and 1.6.1)
Apache Tomcat 6 - for Solr Lucine - full text search

Tomcat 7 has 9GB Ram, Tomcat 6 has 5Gb.
I tried with several initial and maximum memory pool in tomcat (from 6GB max to 13Gb max) but it is pretty much the same after an hour or so.
Last few days Wayback reindexed data (89GB, 28.000 files in sub directories) and I had to stop all other applications for a few days, because the process (create index for 27.000 files took almost 3 days)

So, the problem we have is very, very slow Tomcat. We prectically get timeout every time using WCT. It takes a whole minute to get welcome screen.Wayback is also very very slow. Now, wher Tomcat (7 and 6) got 13.5 gb together, windows consuption of RAM is approx 15GB.

Java opts is:
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Xms768m -Xmx13000m

JVM version is 1.6.0_21-b07
I just tried with disabling Paging file, we'll see the difference..

I have a jsp script which shows the consumption of Java - here is a static html file: http://www.nuk.uni-lj.si/java.htm

My question is - is this normal? Am I doing something wrong or do I really need more than 20GB of RAM to run Wayback machine and WCT on the same machine.

If anyone has an idea - please share it with me.
If anyone needs more data to understand the situation - here I am..

Best wishes,
Matjaž

Re: [Archive-access-discuss] Accessing Wayback Index

From: Bradley T. <br...@ar...> - 2011-12-09 01:14:39

Hi Armin,

One other possibility, assuming you're using the automatic indexing 
systems in Wayback (the BDBIndex) is to look in your wayback directory 
under ".../index-data/merged/" where Wayback keeps a copy of the same 
CDX files that the "cdx-indexer" tool will create.

Column 1 is the "canonicalized" (normalized) URL, and column 3 is the 
original URL.

Brad

On 12/6/11 11:15 AM, Aaron Binns wrote:
> Armin Schleicher<Arm...@ui...>  writes:
>
>> Thanks for your reply! I would like to get a list of the urls in my
>> local wayback deployment.
> The Wayback Machine install package comes with a command-line tool for
> generating a CDX file for an ARC or WARC file, e.g.
>
>   ${wayback-install}/bin/cdx-indexer
>
> You can run it on your (w)arc files, one at a time, like this
>
>    $ cdx-indexer foo.arc.gz foo.cdx
>
> which reads foo.arc.gz and puts the index into foo.cdx.
>
> By default, the first column of the resulting foo.cdx file is the URL of
> the record.  There is one line in the CDX per record in the (w)arc.
>
>
> Hope that helps,
>
> Aaron
>
>
> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point of
> discussion for anyone considering optimizing the pricing and packaging model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] Accessing Wayback Index

From: Aaron B. <aa...@ar...> - 2011-12-06 19:15:36

Armin Schleicher <Arm...@ui...> writes:

> Thanks for your reply! I would like to get a list of the urls in my
> local wayback deployment.

The Wayback Machine install package comes with a command-line tool for
generating a CDX file for an ARC or WARC file, e.g.

 ${wayback-install}/bin/cdx-indexer 

You can run it on your (w)arc files, one at a time, like this

  $ cdx-indexer foo.arc.gz foo.cdx

which reads foo.arc.gz and puts the index into foo.cdx.

By default, the first column of the resulting foo.cdx file is the URL of
the record.  There is one line in the CDX per record in the (w)arc.

Hope that helps,

Aaron

Re: [Archive-access-discuss] Accessing Wayback Index

From: Aaron B. <aa...@ar...> - 2011-12-06 18:44:10

Hello,

I'm not sure exactly what you mean by "access the Wayback database".

Do you mean:

 1 The list of URLs on waybackmachine.org
 2 The list of URLs in your local wayback deployment
 3 The list of URLs in your Archive-It collections

Since you are downloading the (w)arc files from Archive-It and putting
them into your own Wayback installation, what information are you
missing?


Aaron

[Archive-access-discuss] Accessing Wayback Index

From: Armin S. <Arm...@ui...> - 2011-12-06 08:57:24

Hello List,

i am wondering if it is possible to access the Wayback database. I am 
currently designing an archive website and at the moment we are 
harvesting using archive-it!. I have a cronjob running that fetches the 
new arc files and imports them into my local Wayback install. What i 
would like to do is to check what URL's are accessible via Wayback to 
put them into my database.
My second option, since i create a Lucene index using NutchWAX woud be 
to get all the URL fields from there, but it seems like a workaround to 
me...

Thanks for your help,

Bests Armin

Re: [Archive-access-discuss] [Webcurator-users] Wayback reindex Failed Canonicalize problem

From: Allen S. <all...@gm...> - 2011-11-23 01:50:03

Dear Finn Bradley,
Good news, I can replay partial of my web content now, but still have the
FAILED CANNONICALIZE warnings. I think nothing wrong with the datadir.
Any updates or anything I need to check for the Failed Cannonicalize?
Please advice.

Thank you.
Regards,
Allen

On Tue, Nov 22, 2011 at 9:40 AM, Allen Sim <all...@gm...> wrote:

> Hi Finn, Bradley L,
> 1. I have replaced my entire datadirs, but the result still the same, as I
> look into the catalina log file, FAILED CANONICALIZE still appeared.
> 2. The reason for files 2 is because I configure it such way so that
> wayback will look for *any* ARC/WARC files under
> /tmp/arcstores,automatically.
>
> Please advice and help. I am vry lost...
> Thanks in advance.
>
> Regards,
> Allen
>
>
>
> On Tue, Nov 22, 2011 at 5:39 AM, Finn, Bradley L <
> bra...@ed...> wrote:
>
>> I have told you before….****
>>
>> ** **
>>
>> You haven’t enabled recurse on your files1 bean and I don’t know why you
>> have a files2 bean.****
>>
>> ** **
>>
>> Replace your entire datadirs with this:****
>>
>> ** **
>>
>>   <bean id="datadirs"
>> class="org.springframework.beans.factory.config.ListFactoryBean">****
>>
>>     <property name="sourceList">****
>>
>>       <list>****
>>
>>         <bean
>> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
>> ****
>>
>>           <property name="name" value="arcs" />****
>>
>>           <property name="prefix" value="/tmp/wayback/files1/" />****
>>
>>           <property name="recurse" value="true" />****
>>
>>         </bean>****
>>
>>       </list>****
>>
>>     </property>****
>>
>>   </bean>****
>>
>> ** **
>>
>> Then re-index.****
>>
>> ** **
>>
>> *From:* Allen Sim [mailto:all...@gm...]
>> *Sent:* Monday, 21 November 2011 5:17 PM
>> *To:* arc...@li...;
>> web...@li...
>> *Subject:* [Webcurator-users] Wayback reindex Failed Canonicalize problem
>> ****
>>
>> ** **
>>
>> Hi all,
>> I tried to reindex and replay back my harvested websites.
>> I stopped my tomcat, deleted my temp/wayback file then recreate
>> tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into
>> tmp/wayback/files1 and lastly restart my tomcat again.
>> But i cannot replay back my harvested websites. I checked at my
>> catalina.log file and noticed that inside a lot of warning saying that
>> "FAILED CANONICALIZE".
>> Following is my BDBCollection.xml:
>>  <bean id="datadirs"
>> class="org.springframework.beans.factory.config.ListFactoryBean">
>>     <property name="sourceList">
>>       <list>
>>         <bean
>> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
>>           <property name="name" value="files1" />
>>           <property name="prefix" value="/tmp/wayback/files1/" />
>>          </bean>
>>          <bean
>> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
>>          <property name="name" value="files2" />
>>          <property name="prefix" value="/tmp/arcstore/" />
>>          <property name="recurse" value="true" />
>>         </bean>
>>
>> Anthing wrong???
>> I need your help and guidance.
>> Thanks in advance.
>>
>> Regards,
>> Allen ****
>>
>>
>>
>> -----------------------------------------------------------------------------
>> CONFIDENTIALITY NOTICE AND DISCLAIMER
>>
>> Information in this transmission is intended only for the person(s) to
>> whom it is addressed and may contain privileged and/or confidential
>> information. If you are not the intended recipient, any disclosure, copying
>> or dissemination of the information is unauthorised and you should
>> delete/destroy all copies and notify the sender. No liability is accepted
>> for any unauthorised use of the information contained in this transmission.
>>
>> This disclaimer has been automatically added.
>>
>
>

Re: [Archive-access-discuss] [Webcurator-users] Wayback reindex Failed Canonicalize problem

From: Allen S. <all...@gm...> - 2011-11-22 01:40:29

Hi Finn, Bradley L,
1. I have replaced my entire datadirs, but the result still the same, as I
look into the catalina log file, FAILED CANONICALIZE still appeared.
2. The reason for files 2 is because I configure it such way so that
wayback will look for *any* ARC/WARC files under
/tmp/arcstores,automatically.

Please advice and help. I am vry lost...
Thanks in advance.

Regards,
Allen



On Tue, Nov 22, 2011 at 5:39 AM, Finn, Bradley L <
bra...@ed...> wrote:

> I have told you before….****
>
> ** **
>
> You haven’t enabled recurse on your files1 bean and I don’t know why you
> have a files2 bean.****
>
> ** **
>
> Replace your entire datadirs with this:****
>
> ** **
>
>   <bean id="datadirs"
> class="org.springframework.beans.factory.config.ListFactoryBean">****
>
>     <property name="sourceList">****
>
>       <list>****
>
>         <bean
> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
> ****
>
>           <property name="name" value="arcs" />****
>
>           <property name="prefix" value="/tmp/wayback/files1/" />****
>
>           <property name="recurse" value="true" />****
>
>         </bean>****
>
>       </list>****
>
>     </property>****
>
>   </bean>****
>
> ** **
>
> Then re-index.****
>
> ** **
>
> *From:* Allen Sim [mailto:all...@gm...]
> *Sent:* Monday, 21 November 2011 5:17 PM
> *To:* arc...@li...;
> web...@li...
> *Subject:* [Webcurator-users] Wayback reindex Failed Canonicalize problem*
> ***
>
> ** **
>
> Hi all,
> I tried to reindex and replay back my harvested websites.
> I stopped my tomcat, deleted my temp/wayback file then recreate
> tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into
> tmp/wayback/files1 and lastly restart my tomcat again.
> But i cannot replay back my harvested websites. I checked at my
> catalina.log file and noticed that inside a lot of warning saying that
> "FAILED CANONICALIZE".
> Following is my BDBCollection.xml:
>  <bean id="datadirs"
> class="org.springframework.beans.factory.config.ListFactoryBean">
>     <property name="sourceList">
>       <list>
>         <bean
> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
>           <property name="name" value="files1" />
>           <property name="prefix" value="/tmp/wayback/files1/" />
>          </bean>
>          <bean
> class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
>          <property name="name" value="files2" />
>          <property name="prefix" value="/tmp/arcstore/" />
>          <property name="recurse" value="true" />
>         </bean>
>
> Anthing wrong???
> I need your help and guidance.
> Thanks in advance.
>
> Regards,
> Allen ****
>
>
>
> -----------------------------------------------------------------------------
> CONFIDENTIALITY NOTICE AND DISCLAIMER
>
> Information in this transmission is intended only for the person(s) to
> whom it is addressed and may contain privileged and/or confidential
> information. If you are not the intended recipient, any disclosure, copying
> or dissemination of the information is unauthorised and you should
> delete/destroy all copies and notify the sender. No liability is accepted
> for any unauthorised use of the information contained in this transmission.
>
> This disclaimer has been automatically added.
>

Re: [Archive-access-discuss] [Webcurator-users] Wayback reindex Failed Canonicalize problem

From: Finn, B. L <bra...@ed...> - 2011-11-21 21:39:36

I have told you before....

 

You haven't enabled recurse on your files1 bean and I don't know why you
have a files2 bean.

 

Replace your entire datadirs with this:

 

  <bean id="datadirs"
class="org.springframework.beans.factory.config.ListFactoryBean">

    <property name="sourceList">

      <list>

        <bean
class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF
ileSource">

          <property name="name" value="arcs" />

          <property name="prefix" value="/tmp/wayback/files1/" />

          <property name="recurse" value="true" />

        </bean>

      </list>

    </property>

  </bean>

 

Then re-index.

 

From: Allen Sim [mailto:all...@gm...] 
Sent: Monday, 21 November 2011 5:17 PM
To: arc...@li...;
web...@li...
Subject: [Webcurator-users] Wayback reindex Failed Canonicalize problem

 

Hi all,
I tried to reindex and replay back my harvested websites. 
I stopped my tomcat, deleted my temp/wayback file then recreate
tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into
tmp/wayback/files1 and lastly restart my tomcat again. 
But i cannot replay back my harvested websites. I checked at my
catalina.log file and noticed that inside a lot of warning saying that
"FAILED CANONICALIZE". 
Following is my BDBCollection.xml:
 <bean id="datadirs"
class="org.springframework.beans.factory.config.ListFactoryBean">
    <property name="sourceList">
      <list>
        <bean
class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF
ileSource">
          <property name="name" value="files1" />
          <property name="prefix" value="/tmp/wayback/files1/" />
         </bean>
         <bean
class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceF
ileSource">
         <property name="name" value="files2" />
         <property name="prefix" value="/tmp/arcstore/" />
         <property name="recurse" value="true" />
        </bean>

Anthing wrong??? 
I need your help and guidance. 
Thanks in advance. 

Regards,
Allen 


-----------------------------------------------------------------------------
CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission.

This disclaimer has been automatically added.

[Archive-access-discuss] Wayback reindex Failed Canonicalize problem

From: Allen S. <all...@gm...> - 2011-11-21 06:16:43

Hi all,
I tried to reindex and replay back my harvested websites.
I stopped my tomcat, deleted my temp/wayback file then recreate
tmp/wayback/files1 and copied all the Arc files from tmp/arcstore into
tmp/wayback/files1 and lastly restart my tomcat again.
But i cannot replay back my harvested websites. I checked at my
catalina.log file and noticed that inside a lot of warning saying that
"FAILED CANONICALIZE".
Following is my BDBCollection.xml:
 <bean id="datadirs"
class="org.springframework.beans.factory.config.ListFactoryBean">
    <property name="sourceList">
      <list>
        <bean
class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
          <property name="name" value="files1" />
          <property name="prefix" value="/tmp/wayback/files1/" />
         </bean>
         <bean
class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
         <property name="name" value="files2" />
         <property name="prefix" value="/tmp/arcstore/" />
         <property name="recurse" value="true" />
        </bean>

Anthing wrong???
I need your help and guidance.
Thanks in advance.

Regards,
Allen

[Archive-access-discuss] image-search plugin

From: <al...@ai...> - 2011-11-14 18:49:19

Hello,

I was able to checkout image-search plugin. In Readme file it is stated that one must use nutch-1.0-dev. However, I was unable to find this release in Nutch repository and on the net. I tried to use nutch-1.0. 
However, when I try ant tar in imagsearch folder it gives errors like

 error: cannot find symbol
    [javac]             extends org.apache.hadoop.mapred.OutputFormatBase<WritableComparable, LuceneDocumentWrapper> {
    [javac]                                             ^
    [javac]   symbol:   class OutputFormatBase
    [javac]   location: package org.apache.hadoop.mapred


Could you please let me know how these errors could be fixed?

Thanks.
Alex.