You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: John H. L. <jl...@ar...> - 2008-02-05 20:29:34
|
Hi Miguel. To use distributed search, you need to plan ahead a bit and generate multiple indices. I don't know of a way to partition an existing large index into smaller chunks. For example, if you're indexing 100,000 ARCs and want to deploy on 10 machines, you should split your list of ARCs into 10 chunks of 10,000, invoke ImportArcs for each chunk, and invoke NutchwaxIndexer for each chunk. This will produce 10 segment/index pairs, each of which could be deployed on one of your 10 machines. For large jobs, I usually split the ARCs into groups of 1000. This produces segment/index pairs that are small enough to be manageable and flexible when it comes to deployment layout. Hope this helps. -J On Feb 5, 2008, at 5:12 AM, Miguel Costa wrote: > Hi to all, > > After reading the nutchwax + nutch documentation I can index ARC > files and search them using the nutchwax + wayback machine. > However, I would like to perform a distributed search but I don't > find any documentation on how to partition the index in n parts/ > segments for n machines. > On the other hand there is information explaining how to distribute > search using the search-servers.txt file, but I need to partition > the index first. > Can anyone explain me or give me a clue on how to partition an index > for n machines? > > Regards, > > Miguel Costa > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Chris V. <cv...@gm...> - 2008-02-05 17:08:55
|
Just a little update. I was able to get this to work in Wayback 0.8.0 (which we are still using in a production app). The ResultURIConverter class was extended and the makeReplayUri(SearchResult) method was overridden to use the RESULT_URL_KEY (which contains port information) to form the replay URI (instead of the RESULT_URL). The JSReplayRenderer class was also extended so that the <base href. ..> tag and the javascript sResourceUrl variable also use the RESULT_URL_KEY value. After minimal testing, this seems to work without breaking any existing functionality. Does anyone know of a scenario where this will not work? Eventually, once we move to Wayback 1.0.1, similar changes will need to be made there. -Chris On Feb 4, 2008 3:19 PM, Chris Vicary <cv...@gm...> wrote: > Hi all, > > I am having a problem retrieving harvested resources whose urls include > port numbers using Wayback 1.0.1. We have a seed that includes a port > number that was harvested using heritrix. The resulting arc files were > indexed using wayback, and the urls stored in the index include the port > number. Using the wayback web address search interface, I am able to find > the urls by including the port number in the search string (if the port > number is not included, no results are found - which is expected). The link > for the search result does not include the port number, however, and > clicking it does not retrieve the harvested resource. If the port number is > inserted into the search result link, retrieval works fine. Even so, > rewritten links on the retrieved page do not include a port number where > applicable. So my question is, how do I ensure that port numbers are > preserved in wayback search results and in rewritten links? > > Thanks, > > Chris > |
|
From: Miguel C. <mig...@fc...> - 2008-02-05 13:13:05
|
Hi to all, After reading the nutchwax + nutch documentation I can index ARC files and search them using the nutchwax + wayback machine. However, I would like to perform a distributed search but I don't find any documentation on how to partition the index in n parts/segments for n machines. On the other hand there is information explaining how to distribute search using the search-servers.txt file, but I need to partition the index first. Can anyone explain me or give me a clue on how to partition an index for n machines? Regards, Miguel Costa |
|
From: Chris V. <cv...@gm...> - 2008-02-04 20:19:10
|
Hi all, I am having a problem retrieving harvested resources whose urls include port numbers using Wayback 1.0.1. We have a seed that includes a port number that was harvested using heritrix. The resulting arc files were indexed using wayback, and the urls stored in the index include the port number. Using the wayback web address search interface, I am able to find the urls by including the port number in the search string (if the port number is not included, no results are found - which is expected). The link for the search result does not include the port number, however, and clicking it does not retrieve the harvested resource. If the port number is inserted into the search result link, retrieval works fine. Even so, rewritten links on the retrieved page do not include a port number where applicable. So my question is, how do I ensure that port numbers are preserved in wayback search results and in rewritten links? Thanks, Chris |
|
From: John H. L. <jl...@ar...> - 2008-02-04 18:31:55
|
Hi Jack. It sounds like you need to increase the number of open files that a single process can have on your system. The more indexes you're searching, the more filehandles the searcher process needs. On Unix-like systems, "ulimit -n" will tell you the current setting, and "ulimit -n N" will set a new value for your current shell. If you're using Linux, you can usually set these values permanently in / etc/security/limits.conf. For our systems, we use an arbitrary but high ulimit -n 32768. Hope this helps. -J On Feb 4, 2008, at 7:06 AM, Pope, Jackson wrote: > Hiya all, > > I’m trying to run NutchWax for searching. It’s all set up ok, doing > incremental indexing, and works fine. I’m not merging the indexes > however (I’ve a separate directory for each under /indexes, with an > index.done file in each). This works fine for small numbers of > indexes but with large numbers of indexes, I get an error when > browsing to the Nutchwax search page (before entering any search > criteria): Too many files open. > > Do I have to merge all the indexes together to get around this, or > is there another solution? > > Cheers, > > Jack > > Jackson Pope > Technical Lead > Web Archiving Team > The British Library > +44 (0)1937 54 6942 > > ************************************************************************** > > Experience the British Library online at www.bl.uk > > The British Library’s new interactive Annual Report and Accounts > 2006/07 : www.bl.uk/mylibrary > > Help the British Library conserve the world's knowledge. Adopt a > Book. www.bl.uk/adoptabook > > The Library's St Pancras site is WiFi - enabled > > ************************************************************************* > > The information contained in this e-mail is confidential and may be > legally privileged. It is intended for the addressee(s) only. If you > are not the intended recipient, please delete this e-mail and notify > the pos...@bl... : The contents of this e-mail must not be > disclosed or copied without the sender's consent. > > The statements and opinions expressed in this message are those of > the author and do not necessarily reflect those of the British > Library. The British Library does not take any responsibility for > the views of the author. > > ************************************************************************* > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Pope, J. <Jac...@bl...> - 2008-02-04 15:06:57
|
Hiya all, =20 I'm trying to run NutchWax for searching. It's all set up ok, doing incremental indexing, and works fine. I'm not merging the indexes however (I've a separate directory for each under /indexes, with an index.done file in each). This works fine for small numbers of indexes but with large numbers of indexes, I get an error when browsing to the Nutchwax search page (before entering any search criteria): Too many files open. =20 Do I have to merge all the indexes together to get around this, or is there another solution? =20 Cheers, =20 Jack =20 Jackson Pope Technical Lead Web Archiving Team The British Library +44 (0)1937 54 6942 *************************************************************************= * =20 Experience the British Library online at www.bl.uk =20 The British Library's new interactive Annual Report and Accounts 2006/07 = : www.bl.uk/mylibrary =20 Help the British Library conserve the world's knowledge. Adopt a Book. = www.bl.uk/adoptabook =20 The Library's St Pancras site is WiFi - enabled =20 *************************************************************************= =20 The information contained in this e-mail is confidential and may be = legally privileged. It is intended for the addressee(s) only. If you are = not the intended recipient, please delete this e-mail and notify the = pos...@bl... : The contents of this e-mail must not be disclosed or = copied without the sender's consent.=20 =20 The statements and opinions expressed in this message are those of the = author and do not necessarily reflect those of the British Library. The = British Library does not take any responsibility for the views of the = author.=20 =20 *************************************************************************= |
|
From: Miguel C. <mig...@fc...> - 2008-02-04 10:36:14
|
Thank you Brad. This FIX solved the problem. Best regards, Miguel Costa -----Original Message----- From: Brad Tofel [mailto:br...@ar...] Sent: sexta-feira, 1 de Fevereiro de 2008 20:16 To: Miguel Costa Cc: arc...@li... Subject: Re: [Archive-access-discuss] org.archive.io.NoGzipMagicException Hey Miguel, I think I just found the problem: I hadn't checked in a small but crucial change to the wayback-code pom.xml which increases the dependency on archive-commons from 2.0.0 to 2.0.1.. I'm betting this makes all the difference. Please try updating to the latest HEAD and let me know if that works for you. Brad Miguel Costa wrote: > Hello, > > I installed wayback 1.1.0-SNAPSHOT from svn. When I query the wayback > with an URL I get a: > > org.archive.io.NoGzipMagicException > org.archive.io.GzipHeader.readHeader(GzipHeader.java:122) > org.archive.io.GzipHeader.<init>(GzipHeader.java:107) > > org.archive.io.GzippedInputStream.readHeader(GzippedInputStream.java:3 > 35) > > org.archive.io.GzippedInputStream.gzipMemberSeek(GzippedInputStream.ja > va:370 > ) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderF > actory > .java:383) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderF > actory > .java:326) > > org.archive.wayback.resourcestore.LocalARCResourceStore.retrieveResour > ce(Loc > alARCResourceStore.java:108) > > org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:3 > 12) > > org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java: > 280) > > org.archive.wayback.webapp.RequestFilter.handle(RequestFilter.java:106 > ) > > org.archive.wayback.webapp.RequestFilter.doFilter(RequestFilter.java:9 > 0) > > The wayback find de file and then check if it is OK. This check thows > a NoGzipMagicException because it doesn't find a "magic" number. > The code used is in commons-2.0.0-SNAPSHOT-sources.jar (from Heritrix) > for both projects - nutchwax and wayback. > > I also installed nutchax 0.11.0-SNAPSHOT from svn (both projects from > trunk) and indexed the same ARC files. The query's results are presented ok. > Other files present the same symptoms. > Does anyone have a clue of this problem? Does anyone use this version > of wayback without problems? > > > Thanks > -- > > Miguel Costa > > > > > ---------------------------------------------------------------------- > -- > > ---------------------------------------------------------------------- > --- This SF.net email is sponsored by: Microsoft Defy all challenges. > Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ---------------------------------------------------------------------- > -- > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Kaisa K. <kau...@cs...> - 2008-02-04 08:58:02
|
Hi, are there any plans to add full text indexing and search to wayback? I know nutchwax makes full text indexes but - latest nutchwax release (0.10.0) dates from January 2007 - it's awkward to use when you want to keep duplicate versions of harvested pages - seems to be difficult to integrate with wayback user interface Best, kk |
|
From: Brad T. <br...@ar...> - 2008-02-01 20:13:43
|
Hey Miguel, I think I just found the problem: I hadn't checked in a small but crucial change to the wayback-code pom.xml which increases the dependency on archive-commons from 2.0.0 to 2.0.1.. I'm betting this makes all the difference. Please try updating to the latest HEAD and let me know if that works for you. Brad Miguel Costa wrote: > Hello, > > I installed wayback 1.1.0-SNAPSHOT from svn. When I query the wayback with > an URL I get a: > > org.archive.io.NoGzipMagicException > org.archive.io.GzipHeader.readHeader(GzipHeader.java:122) > org.archive.io.GzipHeader.<init>(GzipHeader.java:107) > org.archive.io.GzippedInputStream.readHeader(GzippedInputStream.java:335) > > org.archive.io.GzippedInputStream.gzipMemberSeek(GzippedInputStream.java:370 > ) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory > .java:383) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory > .java:326) > > org.archive.wayback.resourcestore.LocalARCResourceStore.retrieveResource(Loc > alARCResourceStore.java:108) > org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:312) > org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:280) > org.archive.wayback.webapp.RequestFilter.handle(RequestFilter.java:106) > org.archive.wayback.webapp.RequestFilter.doFilter(RequestFilter.java:90) > > The wayback find de file and then check if it is OK. This check thows a > NoGzipMagicException because it doesn't find a "magic" number. > The code used is in commons-2.0.0-SNAPSHOT-sources.jar (from Heritrix) for > both projects - nutchwax and wayback. > > I also installed nutchax 0.11.0-SNAPSHOT from svn (both projects from trunk) > and indexed the same ARC files. The query's results are presented ok. > Other files present the same symptoms. > Does anyone have a clue of this problem? Does anyone use this version of > wayback without problems? > > > Thanks > -- > > Miguel Costa > > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Brad T. <br...@ar...> - 2008-01-30 04:12:31
|
Hi Miguel, The SVN code was not quite coherent: the wayback.xml configuration file, specifically, was not referencing the new implementation classes for the ResourceStore. I'm hoping that this is the issue, but see below for some more notes if you're still having problems. The major changes you'll need to make in the wayback.xml are in the ResourceStore and Replay configurations. I'm not convinced this will solve the problem though, since you were able to index the documents OK.. With what version of the wayback code did you first index them? One last question is how the ARCs were compressed. Were they written compressed by Heritrix, or compressed later? If the new wayback.xml (using different implementations) does not fix the problem, one thing that may help me figure out what's going wrong would be a fragment of one of your ARC files. Can you post part of one of your ARC files somewhere, for example, just the first few 100KB? (head -c 200000 foo.arc.gz > sample.arc.gz -- understanding that the last record in the ARC fragment will probably be truncated.) Brad Miguel Costa wrote: > Hello, > > I installed wayback 1.1.0-SNAPSHOT from svn. When I query the wayback with > an URL I get a: > > org.archive.io.NoGzipMagicException > org.archive.io.GzipHeader.readHeader(GzipHeader.java:122) > org.archive.io.GzipHeader.<init>(GzipHeader.java:107) > org.archive.io.GzippedInputStream.readHeader(GzippedInputStream.java:335) > > org.archive.io.GzippedInputStream.gzipMemberSeek(GzippedInputStream.java:370 > ) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory > .java:383) > > org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory > .java:326) > > org.archive.wayback.resourcestore.LocalARCResourceStore.retrieveResource(Loc > alARCResourceStore.java:108) > org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:312) > org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:280) > org.archive.wayback.webapp.RequestFilter.handle(RequestFilter.java:106) > org.archive.wayback.webapp.RequestFilter.doFilter(RequestFilter.java:90) > > The wayback find de file and then check if it is OK. This check thows a > NoGzipMagicException because it doesn't find a "magic" number. > The code used is in commons-2.0.0-SNAPSHOT-sources.jar (from Heritrix) for > both projects - nutchwax and wayback. > > I also installed nutchax 0.11.0-SNAPSHOT from svn (both projects from trunk) > and indexed the same ARC files. The query's results are presented ok. > Other files present the same symptoms. > Does anyone have a clue of this problem? Does anyone use this version of > wayback without problems? > > > Thanks > -- > > Miguel Costa > > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Brad T. <br...@ar...> - 2008-01-30 03:53:50
|
Hi Natalia, Sorry about the delayed response. I'm not sure I understand your question/problem. Are you trying to run multiple tomcat instances(processes) using the same BDB index? In this case, I don't think it will work - the BDBJE code wants only a single process to write to an environment, and currently the wayback always tries to open the BDB read-write. If this is something that really needs to happen, we mightbe able to change the way the software opens the databases, but I'm not sure if BDBJE supports this. You may be able to solve this problem using multiple AccessPoints, using wayback 1.0 or later. Can you elaborate on how you're trying to set up the wayback in this deployment? Brad Natalia Torres wrote: > Hello > > I installed wayback 0.8 following the instructions on web page and > customizing it for using the wayback machine in timeline access > mode. > > When I upgrade to use 2 tomcat v.5.5 on cluster, wayback doesn't work. > I get this message from log file: > > INFO: new org.archive.wayback.resourceindex.LocalResourceIndex created. > org.archive.wayback.exception.ConfigurationException: A je..lckfile > exists in /mywaybackdata/index The environment can not be locked for > single writer access. > > When I search on UI I get this message: > > Configuration Error > > A je..lckfile exists in /paditest/wayback/index The environment can not > be locked for single writer access. > > > Each tomcat must have his own indexes? They can't share it? Exist any > setting for this in configuration files= > > Thanks > > N. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Miguel C. <mig...@fc...> - 2008-01-29 16:06:13
|
Hello, I installed wayback 1.1.0-SNAPSHOT from svn. When I query the wayback with an URL I get a: org.archive.io.NoGzipMagicException org.archive.io.GzipHeader.readHeader(GzipHeader.java:122) org.archive.io.GzipHeader.<init>(GzipHeader.java:107) org.archive.io.GzippedInputStream.readHeader(GzippedInputStream.java:335) org.archive.io.GzippedInputStream.gzipMemberSeek(GzippedInputStream.java:370 ) org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory .java:383) org.archive.io.arc.ARCReaderFactory$CompressedARCReader.get(ARCReaderFactory .java:326) org.archive.wayback.resourcestore.LocalARCResourceStore.retrieveResource(Loc alARCResourceStore.java:108) org.archive.wayback.webapp.AccessPoint.handleReplay(AccessPoint.java:312) org.archive.wayback.webapp.AccessPoint.handleRequest(AccessPoint.java:280) org.archive.wayback.webapp.RequestFilter.handle(RequestFilter.java:106) org.archive.wayback.webapp.RequestFilter.doFilter(RequestFilter.java:90) The wayback find de file and then check if it is OK. This check thows a NoGzipMagicException because it doesn't find a "magic" number. The code used is in commons-2.0.0-SNAPSHOT-sources.jar (from Heritrix) for both projects - nutchwax and wayback. I also installed nutchax 0.11.0-SNAPSHOT from svn (both projects from trunk) and indexed the same ARC files. The query's results are presented ok. Other files present the same symptoms. Does anyone have a clue of this problem? Does anyone use this version of wayback without problems? Thanks -- Miguel Costa |
|
From: Natalia T. <nt...@ce...> - 2008-01-18 09:11:12
|
Hello I installed wayback 0.8 following the instructions on web page and customizing it for using the wayback machine in timeline access mode. When I upgrade to use 2 tomcat v.5.5 on cluster, wayback doesn't work. I get this message from log file: INFO: new org.archive.wayback.resourceindex.LocalResourceIndex created. org.archive.wayback.exception.ConfigurationException: A je..lckfile exists in /mywaybackdata/index The environment can not be locked for single writer access. When I search on UI I get this message: Configuration Error A je..lckfile exists in /paditest/wayback/index The environment can not be locked for single writer access. Each tomcat must have his own indexes? They can't share it? Exist any setting for this in configuration files= Thanks N. |
|
From: Erik H. <eri...@uc...> - 2008-01-11 16:09:46
|
At Fri, 11 Jan 2008 16:05:58 -0000, "Pope, Jackson" <Jac...@bl...> wrote: > > Hiya Erik, > > Thanks for all your help. I'd already got it working - I needed to make > the arcs files accessible via Apache - that fixed all the problems. Best of luck with your project! ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 |
|
From: Pope, J. <Jac...@bl...> - 2008-01-11 16:06:02
|
Hiya Erik, Thanks for all your help. I'd already got it working - I needed to make the arcs files accessible via Apache - that fixed all the problems. Cheers, Jack Jackson Pope Technical Lead Web Archiving Team The British Library +44 (0)1937 54 6942 -----Original Message----- From: Erik Hetzner [mailto:eri...@uc...]=20 Sent: 11 January 2008 16:02 To: arc...@li... Cc: Pope, Jackson Subject: Re: [Archive-access-discuss] FW: Wayback 0.8.0 and ArcProxy At Fri, 11 Jan 2008 11:10:41 -0000, "Pope, Jackson" <Jac...@bl...> wrote: >=20 > Hiya Erik, >=20 > Thanks for your help. I've now got an ArcProxy running (by copying > wayback.war to arc-proxy.war, deploying the proxy and changing its > config so only the ArcProxy/LocationDB section is not commented > out), but I still can't see the files. >=20 > I've run curl as you suggested and it appears to work (urls munged > below): > curl http://www.example.com:8080/arc-proxy/locationDB -d = operation=3Dadd > -d > name=3D/wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.= a > rc.gz -d > url=3Dhttp://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000= - > wap300.bl.uk.arc.gz > OK added url > http://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000-wap3 > 00.bl.uk.arc.gz for > /wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.arc.gz >=20 > Yet when I try to browse the wayback for a URL in that arc I get an > error saying the resource is unavailable with the following error in > catalina.out: > > INFO: initialized org.archive.wayback.resourcestore.http.FileLocationDB > com.sleepycat.je.DatabaseException: Unable to > locate(IAH-20070920101741-00000-wap300.bl.uk.arc.gz) > [...] >=20 > It's not a problem with the arc file, as this was fine when I was using > a local ARC store. >=20 > Any ideas? Hi Jack. I think that the problem might be that you are using the full path as the name parameter. Can you try using just the ARC name as the name parameter instead? For example: curl http://www.example.com:8080/arc-proxy/locationDB -d operation=3Dadd -d name=3DIAH-20070920101741-00000-wap300.bl.uk.arc.gz -d url=3Dhttp://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000= - wap300.bl.uk.arc.gz I believe that name is what arc-proxy uses as the lookup key, so it makes sense that the proxy would be unable to locate 'IAH-20070920101741-00000-wap300.bl.uk.arc.gz': it knows that ARC as /wap/filestore/... best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 *************************************************************************= * =20 Experience the British Library online at www.bl.uk =20 The British Library's new interactive Annual Report and Accounts 2006/07 = : www.bl.uk/mylibrary =20 Help the British Library conserve the world's knowledge. Adopt a Book. = www.bl.uk/adoptabook =20 The Library's St Pancras site is WiFi - enabled =20 *************************************************************************= =20 The information contained in this e-mail is confidential and may be = legally privileged. It is intended for the addressee(s) only. If you are = not the intended recipient, please delete this e-mail and notify the = pos...@bl... : The contents of this e-mail must not be disclosed or = copied without the sender's consent.=20 =20 The statements and opinions expressed in this message are those of the = author and do not necessarily reflect those of the British Library. The = British Library does not take any responsibility for the views of the = author.=20 =20 *************************************************************************= |
|
From: Erik H. <eri...@uc...> - 2008-01-11 15:59:43
|
At Fri, 11 Jan 2008 11:10:41 -0000, "Pope, Jackson" <Jac...@bl...> wrote: >=20 > Hiya Erik, >=20 > Thanks for your help. I've now got an ArcProxy running (by copying > wayback.war to arc-proxy.war, deploying the proxy and changing its > config so only the ArcProxy/LocationDB section is not commented > out), but I still can't see the files. >=20 > I've run curl as you suggested and it appears to work (urls munged > below): > curl http://www.example.com:8080/arc-proxy/locationDB -d operation=3Dadd > -d > name=3D/wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.a > rc.gz -d > url=3Dhttp://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000- > wap300.bl.uk.arc.gz > OK added url > http://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000-wap3 > 00.bl.uk.arc.gz for > /wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.arc.gz >=20 > Yet when I try to browse the wayback for a URL in that arc I get an > error saying the resource is unavailable with the following error in > catalina.out: > > INFO: initialized org.archive.wayback.resourcestore.http.FileLocationDB > com.sleepycat.je.DatabaseException: Unable to > locate(IAH-20070920101741-00000-wap300.bl.uk.arc.gz) > [=E2=80=A6] >=20 > It's not a problem with the arc file, as this was fine when I was using > a local ARC store. >=20 > Any ideas? Hi Jack. I think that the problem might be that you are using the full path as the name parameter. Can you try using just the ARC name as the name parameter instead? For example: curl http://www.example.com:8080/arc-proxy/locationDB -d operation=3Dadd -d= name=3DIAH-20070920101741-00000-wap300.bl.uk.arc.gz -d url=3Dhttp://www.ex= ample.com:8080/arc-proxy/arcs/IAH-20070920101741-00000-wap300.bl.uk.arc.gz I believe that name is what arc-proxy uses as the lookup key, so it makes sense that the proxy would be unable to locate =E2=80=98IAH-20070920101741-00000-wap300.bl.uk.arc.gz=E2=80=99: it knows th= at ARC as /wap/filestore/... best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 |
|
From: Daniel G. <dan...@fc...> - 2008-01-11 12:46:16
|
Alex Osborne wrote: > Hi Daniel, > > Daniel Gomes <dan...@fc...> writes: > > Hi Alex. Thank you very much for your comments. Check my answers bellow. >> The main idea is to enable Internet users to provide storage space >> from their computers to replicate a relatively small part of the >> archived data. >> > > Have you considered any incentives for users to contribute to the > system? While distributed computation projects (SETI@home, Folding@home > etc) offer some sort of bragging rights about how much data you've > processed, they make use of an idle resource which can never "run out" > (forgetting the electricity bill). With backup storage, once you've > filled up a disk, that's it, you can't contribute any more. Also > "curing cancer" and "advancing science" sound a lot more charitable than > "storing backup copies of old websites". ;-) > > That's a very good point. How to market the project? For now we are more concerned on having the system working properly but it is a question that we will definitely have to address in the future. We have some ideas. We hope that our web archive site will become popular, at least in Portugal, and we intend to have a list of the contributers that provide disk space for the project, presenting highlight links to the sites of the top contributers on the home page of the site. Companies may have commercial interest in having a link to their sites coming from a popular site, national institutions may have interest in showing that they are contributing to preserve national historical contents (not old sites :-)) at the worst scenario we hope that individuals can brag from being contributing for the project. We hope that competition for the top links will motivate users to provide more disk space. We also hope that within the web archiving community, institutions will provide disk space to replicate other web archives. For instance, in our project machines in Portugal we could install a client of the Pandora web archive rARC system and provide space to replicate Australian web contents. Other European web archiving initiatives could do the same and this way at least some Australian web contents would be replicated at different geographical locations and preserved even in case of a catastrophe that would damage the Pandora servers (we hope this will never happen). In return you can do the same for us (or not), and keep copies of our Portuguese contents. We believe that countries that share the same language will feel more encouraged to replicate each others. We could also randomly present in the site information about a contributer of the project. Something like "Contributer of the week", so that even small contributers could brag. The contributers would be notified by email that this week they were elected to be presented in the site. People can choose not to be elected. Anyway, I agree that it is harder to convince people to give disk space than disk. Any more ideas would be most welcome. > An alternative model which is a bit fairer to users is a peer-to-peer > distributed backup system, where users trade their local storage space > in return for having their own files backed up by the community. Thus a > web archiving instution would be just another user. > > However, while there's plenty of discussion and academic papers on > peer-to-peer backup I wasn't able to find any projects that have really > taken off outside the traditional realms of file-sharing (Bittorrent, > Gnutella etc) and anonymity (Freenet). There seems to be just a couple > research projects and a hobbyist one in early stages of development, > which is a pity. > > http://flud.org/ > http://myriadstore.sics.se/ > http://oceanstore.cs.berkeley.edu/ > > I will take a deeper look at these projects. I only knew Oceanstore. At first sight, our project has similarities with Lockss (http://www.lockss.org/lockss/How_It_Works). However, in Lockss is targeted to have libraries as storage nodes, not any Internet users, and to preserve web publications, not general pages. General P2P systems are built on several assumptions/requirements that are not applicable in the web archive context. These are general remarks, P2P systems present differences among them. 1. Contributers want to be anonymous (most of the contents shared are illegal). 2. Contributers want to have access to the contents. 3. The systems are designed to quickly share information and not to preserve it. 4. There is not a single source of the information to replicate as in a web archive. 5. There is no need to retrieve information from all the storage nodes into a single location, as there is in case we want to rebuild a web archive. 6. It is assumed that storage nodes provide small amounts of disk space. We hope that some contributers will provide a considerable amount of disk space. 7. Most of the P2P systems were developed sometime ago when people used them specially to share small files such as MP3s. With videos it is different that's one reason why Bittorrent gain popularity against existing P2P systems. Web archive files such as ARC files are relatively big. Nonetheless, P2P systems could be adapted to provide replication across a controlled set of nodes composed by web archives. Other web archivers please feel free to join this discussion. Your comments will be most welcome. Best regards, /Daniel Gomes > Cheers, > > Alex > -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |
|
From: Pope, J. <Jac...@bl...> - 2008-01-11 11:10:43
|
Hiya Erik, Thanks for your help. I've now got an ArcProxy running (by copying wayback.war to arc-proxy.war, deploying the proxy and changing its config so only the ArcProxy/LocationDB section is not commented out), but I still can't see the files. I've run curl as you suggested and it appears to work (urls munged below): curl http://www.example.com:8080/arc-proxy/locationDB -d operation=3Dadd -d name=3D/wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.= a rc.gz -d url=3Dhttp://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000= - wap300.bl.uk.arc.gz OK added url http://www.example.com:8080/arc-proxy/arcs/IAH-20070920101741-00000-wap3 00.bl.uk.arc.gz for /wap/filestore/2523139/arcs/IAH-20070920101741-00000-wap300.bl.uk.arc.gz Yet when I try to browse the wayback for a URL in that arc I get an error saying the resource is unavailable with the following error in catalina.out: INFO: initialized org.archive.wayback.resourcestore.http.FileLocationDB com.sleepycat.je.DatabaseException: Unable to locate(IAH-20070920101741-00000-wap300.bl.uk.arc.gz) at org.archive.wayback.resourcestore.http.ArcProxyServlet.doGet(ArcProxySer vlet.java:90) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2 63) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84 4) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) It's not a problem with the arc file, as this was fine when I was using a local ARC store. Any ideas? Cheers, Jack Jackson Pope Technical Lead Web Archiving Team The British Library +44 (0)1937 54 6942 -----Original Message----- From: Erik Hetzner [mailto:eri...@uc...]=20 Sent: 11 January 2008 01:43 To: arc...@li... Cc: Pope, Jackson Subject: Re: [Archive-access-discuss] FW: Wayback 0.8.0 and ArcProxy Hi Jackson, At Thu, 10 Jan 2008 13:29:05 -0000, "Pope, Jackson" <Jac...@bl...> wrote: >=20 > Hiya, >=20 > Are there any instructions on how to setup an ArcProxy for use with > Wayback 0.8.0? Unfortunately the docs on the web site are for the 1.0 series of Wayback. It's been a while since I set up an arc proxy for the 0.8 series but I'll do my best. =20 > I've got wayback installed and working with Nutchwax, and > now I'm trying to get the arcs proxied rather than use the > LocalRestoreStore. I want the ArcProxy setup on the same machine, and > the files presented via HTTP on the same machine too (though they are > stored on an NFS server). If you only have one set of files served via HTTP, I am not sure you need the proxy. You should be able to get away with just an HTTP resource store; the proxy is only necessary if you have many HTTP servers serving ARC content, and you need a central location to keep track of where ARCs are located. > I've uncommented the appropriate section of the web.xml, and tryied > running location-client, but I'm not sure I've got an ArcProxy running > (is it a separate download or part of Wayback?), and I don't know what > setting to use in the location-client calls or the Remote HTTP1.1 > Resource Store, to get this setup correctly. Is there a document kicking > around that explains this? If you have got an arc proxy working correctly, and you have added to it with the location client, you should be able to do a GET on http://example.org/proxy-prefix/IAH-20070705232355-00000-example.org.arc .gz for a known ARC to get it back. I find it easier to use the following curl command to add ARCs to the arc proxy than using the location-client: curl ${LOCATIONDB_URL} -d operation=3Dadd -d name=3D${F} -d url=3D${BASE_URL}${F} where LOCATIONDB_URL is the arc proxy URL, F is the name of the arc file, and BASE_URL is the base url of the HTTP server where you are serving arc files from. Hope that helps. best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 *************************************************************************= * =20 Experience the British Library online at www.bl.uk =20 The British Library's new interactive Annual Report and Accounts 2006/07 = : www.bl.uk/mylibrary =20 Help the British Library conserve the world's knowledge. Adopt a Book. = www.bl.uk/adoptabook =20 The Library's St Pancras site is WiFi - enabled =20 *************************************************************************= =20 The information contained in this e-mail is confidential and may be = legally privileged. It is intended for the addressee(s) only. If you are = not the intended recipient, please delete this e-mail and notify the = pos...@bl... : The contents of this e-mail must not be disclosed or = copied without the sender's consent.=20 =20 The statements and opinions expressed in this message are those of the = author and do not necessarily reflect those of the British Library. The = British Library does not take any responsibility for the views of the = author.=20 =20 *************************************************************************= |
|
From: Erik H. <eri...@uc...> - 2008-01-11 01:41:45
|
Hi Jackson, At Thu, 10 Jan 2008 13:29:05 -0000, "Pope, Jackson" <Jac...@bl...> wrote: >=20 > Hiya, >=20 > Are there any instructions on how to setup an ArcProxy for use with > Wayback 0.8.0? Unfortunately the docs on the web site are for the 1.0 series of Wayback. It=E2=80=99s been a while since I set up an arc proxy for the 0.8 series but I=E2=80=99ll do my best. =20 > I've got wayback installed and working with Nutchwax, and > now I'm trying to get the arcs proxied rather than use the > LocalRestoreStore. I want the ArcProxy setup on the same machine, and > the files presented via HTTP on the same machine too (though they are > stored on an NFS server). If you only have one set of files served via HTTP, I am not sure you need the proxy. You should be able to get away with just an HTTP resource store; the proxy is only necessary if you have many HTTP servers serving ARC content, and you need a central location to keep track of where ARCs are located. > I've uncommented the appropriate section of the web.xml, and tryied > running location-client, but I'm not sure I've got an ArcProxy running > (is it a separate download or part of Wayback?), and I don't know what > setting to use in the location-client calls or the Remote HTTP1.1 > Resource Store, to get this setup correctly. Is there a document kicking > around that explains this? If you have got an arc proxy working correctly, and you have added to it with the location client, you should be able to do a GET on http://example.org/proxy-prefix/IAH-20070705232355-00000-example.org.arc.gz for a known ARC to get it back. I find it easier to use the following curl command to add ARCs to the arc proxy than using the location-client: curl ${LOCATIONDB_URL} -d operation=3Dadd -d name=3D${F} -d url=3D${BASE_UR= L}${F} where LOCATIONDB_URL is the arc proxy URL, F is the name of the arc file, and BASE_URL is the base url of the HTTP server where you are serving arc files from. Hope that helps. best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 |
|
From: Pope, J. <Jac...@bl...> - 2008-01-10 13:29:03
|
Hiya, =20 Are there any instructions on how to setup an ArcProxy for use with Wayback 0.8.0? I've got wayback installed and working with Nutchwax, and now I'm trying to get the arcs proxied rather than use the LocalRestoreStore. I want the ArcProxy setup on the same machine, and the files presented via HTTP on the same machine too (though they are stored on an NFS server). =20 I've uncommented the appropriate section of the web.xml, and tryied running location-client, but I'm not sure I've got an ArcProxy running (is it a separate download or part of Wayback?), and I don't know what setting to use in the location-client calls or the Remote HTTP1.1 Resource Store, to get this setup correctly. Is there a document kicking around that explains this? =20 Cheers, =20 Jack =20 Jackson Pope Technical Lead Web Archiving Team The British Library +44 (0)1937 54 6942 *************************************************************************= * =20 Experience the British Library online at www.bl.uk =20 The British Library's new interactive Annual Report and Accounts 2006/07 = : www.bl.uk/mylibrary =20 Help the British Library conserve the world's knowledge. Adopt a Book. = www.bl.uk/adoptabook =20 The Library's St Pancras site is WiFi - enabled =20 *************************************************************************= =20 The information contained in this e-mail is confidential and may be = legally privileged. It is intended for the addressee(s) only. If you are = not the intended recipient, please delete this e-mail and notify the = pos...@bl... : The contents of this e-mail must not be disclosed or = copied without the sender's consent.=20 =20 The statements and opinions expressed in this message are those of the = author and do not necessarily reflect those of the British Library. The = British Library does not take any responsibility for the views of the = author.=20 =20 *************************************************************************= |
|
From: Gina J. <gj...@lo...> - 2008-01-09 21:26:21
|
V2UgYXJlIHB1dHRpbmcgdG9nZXRoZXIgc3BlY2lmaWNhdGlvbnMgZm9yIGluc3RhbGxhdGlvbiBo ZXJlIGF0IHRoZSBMaWJyYXJ5LiAgT24gdGhlIFdheWJhY2sgcGFnZSwgc2F5cyBpdCB3YXMgdGVz dGVkIHdpdGggNS41Lg0KDQpDdXJyZW50IHRvbWNhdCB2ZXJzaW9uIGlzIDYuMC4gIERvIHdlIG5l ZWQgdG8gc3BlY2lmeSB0b21jYXQgdmVyc2lvbiA1LjUgb3IgY2FuIHdlIHNwZWNpZnkgdG9tY2F0 IHZlcnNpb24gNS41IG9yIGhpZ2hlcj8NCg0KdGhhbmtzLCBnaW5hDQoNCkdpbmEgSm9uZXMsIGdq b25AbG9jLmdvdg0KRGlnaXRhbCBNZWRpYSBQcm9qZWN0IENvb3JkaW5hdG9yDQpPZmZpY2Ugb2Yg U3RyYXRlZ2ljIEluaXRpYXRpdmVzDQpMaWJyYXJ5IG9mIENvbmdyZXNzDQpodHRwOi8vd3d3Lmxv Yy5nb3Yvd2ViY2FwdHVyZSANCjEtMjAyLTcwNy02NjA0DQpnam9uQGxvYy5nb3Y= |
|
From: Alex O. <aos...@nl...> - 2008-01-09 03:43:52
|
Hi Daniel, Daniel Gomes <dan...@fc...> writes: > The main idea is to enable Internet users to provide storage space > from their computers to replicate a relatively small part of the > archived data. Have you considered any incentives for users to contribute to the system? While distributed computation projects (SETI@home, Folding@home etc) offer some sort of bragging rights about how much data you've processed, they make use of an idle resource which can never "run out" (forgetting the electricity bill). With backup storage, once you've filled up a disk, that's it, you can't contribute any more. Also "curing cancer" and "advancing science" sound a lot more charitable than "storing backup copies of old websites". ;-) An alternative model which is a bit fairer to users is a peer-to-peer distributed backup system, where users trade their local storage space in return for having their own files backed up by the community. Thus a web archiving instution would be just another user. However, while there's plenty of discussion and academic papers on peer-to-peer backup I wasn't able to find any projects that have really taken off outside the traditional realms of file-sharing (Bittorrent, Gnutella etc) and anonymity (Freenet). There seems to be just a couple research projects and a hobbyist one in early stages of development, which is a pity. http://flud.org/ http://myriadstore.sics.se/ http://oceanstore.cs.berkeley.edu/ Cheers, Alex |
|
From: Daniel G. <dan...@fc...> - 2008-01-07 11:56:35
|
Dear web archivers, Portugal is now beginning its national web archiving initiative with project Tomba at FCCN (National Foundation for Scientific Computing). Tomba aims to create a national web archive system using Archive-access tools and to contribute to this project with enhancements and new tools. The first contribution we intend to make to the Archive-access project is the development of a distributed system, which enables the replication of the archive files kept in a repository (ARC files) across several storage nodes on the Internet. The main idea is to enable Internet users to provide storage space from their computers to replicate a relatively small part of the archived data. Ideally, every ARC file kept in the central repository would have several replicas stored across the Internet. This system was named rARC (ARC replicator). I send a short description of the project in attachment. We would deeply appreciate you comments. Best regards, -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |
|
From: Natalia T. <nt...@ce...> - 2008-01-07 09:52:34
|
Hi Mathew
I review web.xml and both paths are the same directory...
The incoming path in resourceIndex
<param-name>resourceindex.incomingpath</param-name>
<param-value>/wayback/index-data/incoming</param-value>
and the target in resourceStore
<param-name>resourcestore.indextarget</param-name>
<param-value>/wayback/index-data/incoming</param-value>
Any others ideas?
Here is the full web.xml
<?xml version="1.0"?>
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application
2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
<!-- General Installation information
-->
<context-param>
<param-name>installationname</param-name>
<param-value>General Configuration</param-value>
<description>
This text will appear on the Wayback Configuration and
Status page
and may assist in determining which installation users are
viewing
via their web browser in environments with multiple Wayback
installations.
</description>
</context-param>
<listener>
<listener-class>org.archive.wayback.core.WaybackContextListener</listener-class>
</listener>
<!-- START OF Timeline UI OPTIONS
This section contains configuration for using the wayback machine in
timeline
access mode, similar to the WERA application.
These options are not used by default.
-->
<servlet>
<servlet-name>XMLQueryServlet</servlet-name>
<servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
<init-param>
<param-name>queryui.jsppath</param-name>
<param-value>jsp/QueryXMLUI</param-value>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>XMLQueryServlet</servlet-name>
<url-pattern>/xmlquery</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>QueryServlet</servlet-name>
<servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
<init-param>
<param-name>queryui.jsppath</param-name>
<param-value>jsp/QueryUI</param-value>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>QueryServlet</servlet-name>
<url-pattern>/query</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>TimelineQueryServlet</servlet-name>
<servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
<init-param>
<param-name>queryui.jsppath</param-name>
<param-value>jsp/TimelineUI</param-value>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>TimelineQueryServlet</servlet-name>
<url-pattern>/timeline</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>FramesetReplayServlet</servlet-name>
<servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
<init-param>
<param-name>replayrenderer.classname</param-name>
<param-value>org.archive.wayback.timeline.FramesetReplayRenderer</param-value>
<description>Implementation responsible for drawing
replayed resources and replay error messages</description>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>FramesetReplayServlet</servlet-name>
<url-pattern>/frameset</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>InlineReplayServlet</servlet-name>
<servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
<init-param>
<param-name>replayrenderer.classname</param-name>
<param-value>org.archive.wayback.timeline.TimelineReplayRenderer</param-value>
<description>Implementation responsible for drawing
replayed resources and replay error messages</description>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>InlineReplayServlet</servlet-name>
<url-pattern>/replay</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>MetaReplayServlet</servlet-name>
<servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
<init-param>
<param-name>replayrenderer.classname</param-name>
<param-value>org.archive.wayback.timeline.MetaReplayRenderer</param-value>
<description>Implementation responsible for drawing
replayed resources and replay error messages</description>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>MetaReplayServlet</servlet-name>
<url-pattern>/meta</url-pattern>
</servlet-mapping>
<context-param>
<param-name>replayui.jsppath</param-name>
<param-value>jsp/ReplayUI</param-value>
<description>ReplayUI specific path to jsp pages. relative to
webapp/</description>
</context-param>
<context-param>
<param-name>queryrenderer.classname</param-name>
<param-value>org.archive.wayback.timeline.TimelineQueryRenderer</param-value>
<description>Implementation responsible for drawing Index Query
results</description>
</context-param>
<context-param>
<param-name>replayuriconverter.classname</param-name>
<param-value>org.archive.wayback.timeline.TimelineReplayResultURIConverter</param-value>
<description>Class that implements translation of index results
to Replayable URIs for this Wayback</description>
</context-param>
<context-param>
<param-name>jsuri</param-name>
<param-value>http://recercat.test.cesca.es/wayback/jsp/TimelineUI/wm-timeline-text.js,http://recercat.test.cesca.es/wayback/jsp/TimelineUI/wm-timeline.js</param-value>
<description>HTTP URI to javascript files</description>
</context-param>
<context-param>
<param-name>replayuriprefix</param-name>
<param-value>http://recercat.test.cesca.es/wayback/replay</param-value>
<description>HTTP URI prefix for the replay servlet</description>
</context-param>
<context-param>
<param-name>metauriprefix</param-name>
<param-value>http://recercat.test.cesca.es/wayback/meta</param-value>
<description>HTTP URI prefix for the meta replay
servlet</description>
</context-param>
<context-param>
<param-name>timelineuriprefix</param-name>
<param-value>http://recercat.test.cesca.es/wayback/timeline</param-value>
<description>HTTP URI prefix for the timeline servlet</description>
</context-param>
<context-param>
<param-name>frameseturiprefix</param-name>
<param-value>http://recercat.test.cesca.es/wayback/frameset</param-value>
<description>HTTP URI prefix for the frameset servlet</description>
</context-param>
<!-- END OF Timeline UI OPTIONS -->
<!-- START OF Local-ARC ResourceStore OPTIONS
This section contains configuration for accessing ARC files from a single
directory on a local filesystem. If ARC files are spread across multiple
local directories, a single directory be created, and populated with
symbolic
links to the various locations of the ARC files. This configuration
section also
contains specific configuration for an indexing thread, which can optionally
notice new ARC files, generate CDX flat files for new ARCs, and hand off
these
CDX files to a BDB resource index for merging.
-->
<context-param>
<param-name>resourcestore.classname</param-name>
<param-value>org.archive.wayback.resourcestore.LocalARCResourceStore</param-value>
<description>Class that implements ResourceStore for this
Wayback</description>
</context-param>
<context-param>
<param-name>resourcestore.arcpath</param-name>
<param-value>/dades/arcs</param-value>
<description>
Directory where ARC files are found (possibly where
Heritrix writes them.)
This directory must exist.
</description>
</context-param>
<context-param>
<param-name>resourcestore.autoindex</param-name>
<param-value>1</param-value>
<description>
If this is set to '1', then a background thread is launched
that
detects new ARC files appearing in arcpath. New ARCs are
indexed,
and a CDX flat file, with one line per ARC Record is
created, one
CDX file per ARC. These CDX files are then handed off to
the index
for incorporation into the index.
</description>
</context-param>
<context-param>
<param-name>resourcestore.tmppath</param-name>
<param-value>/wayback/arc-indexer/tmp</param-value>
<description>
Directory where CDX files are created temporarily. This is a
scratch space directory, which must exist.
</description>
</context-param>
<context-param>
<param-name>resourcestore.workpath</param-name>
<param-value>/wayback/arc-indexer/work</param-value>
<description>
Directory which holds empty flag files indicating that ARC
files
are waiting to be indexed.
This directory must exist.
</description>
</context-param>
<context-param>
<param-name>resourcestore.queuedpath</param-name>
<param-value>/wayback/arc-indexer/queued</param-value>
<description>
Directory which holds empty flag files indicating that ARC
files
have already been seen and queued for indexing.
This directory must exist.
</description>
</context-param>
<context-param>
<param-name>resourcestore.indextarget</param-name>
<param-value>/wayback/index-data/incoming</param-value>
<description>
Directory or URL where CDX files are sent after they are
created. If
the value of this parameter begins with http://, then the
value is
assumed to be a URL where CDX files are PUT, on a possibly
remote
resourceindex node. If the value does not begin with
http://, then
the value is assumed to be a local directory, which must
exist,
where completed CDX files are moved for incorporation into the
index.
</description>
</context-param>
<context-param>
<param-name>resourcestore.indexinterval</param-name>
<param-value>1000</param-value>
<description>
Millisecond interval between checks for new ARCs that need
to be
processed. This is only the initial time slept when first
starting
up, and after any new files are found. Each interval that
no new
ARCs are detected, the duration slept increases by this amount.
</description>
</context-param>
<!-- END OF Local-ARC ResourceStore OPTIONS -->
<!-- START OF Local-BDB ResourceIndex OPTIONS
This section contains configuration for using a BDB JE to hold the document
index on the local filesystem. This section also contains configuration for
an optional index update thread, which will scan a directory for new
index data,
in CDX format, and will automatically add new index records to the
index.This
is the default index storage implementation.
-->
<filter>
<filter-name>RemoteSubmitFilter</filter-name>
<filter-class>org.archive.wayback.resourceindex.indexer.RemoteSubmitFilter</filter-class>
<init-param>
<param-name>pipeline.statusjsp</param-name>
<param-value>jsp/PipelineUI/PipelineStatus.jsp</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>RemoteSubmitFilter</filter-name>
<url-pattern>/index-incoming/*</url-pattern>
</filter-mapping>
<context-param>
<param-name>resourceindex.classname</param-name>
<param-value>org.archive.wayback.resourceindex.LocalResourceIndex</param-value>
<description>Class that implements ResourceIndex for this
Wayback</description>
</context-param>
<context-param>
<param-name>resourceindex.sourceclass</param-name>
<param-value>BDB</param-value>
<description>Class that implements ResultSource for this Wayback,
currently: BDB|CDX</description>
</context-param>
<context-param>
<param-name>resourceindex.indexpath</param-name>
<param-value>/wayback/index</param-value>
<description>
LocalBDBResourceIndex specific directory to store the BDB
files.
Directory must exists.
</description>
</context-param>
<context-param>
<param-name>resourceindex.dbname</param-name>
<param-value>DB1</param-value>
<description>
LocalBDBResourceIndex specific name for BDB database
</description>
</context-param>
<context-param>
<param-name>resourceindex.incomingpath</param-name>
<param-value>/wayback/index-data/incoming</param-value>
<description>
BDB index-specific configuration that indicates new CDX
format flat
files will appear in the directory named in the value of
this param.
If this configuration is present and non-empty, a
background thread
will be started that monitors this directory, and adds CDX
records
in files found in this directory to the index.
</description>
</context-param>
<context-param>
<param-name>resourceindex.mergedpath</param-name>
<param-value>/wayback/index-data/merged</param-value>
<description>
If this value is present and non-empty, then CDX files that are
successfully processed from incoming are moved to this
directory
after merging. If this option is missing or blank, CDX
files are
deleted after merging.
</description>
</context-param>
<context-param>
<param-name>resourceindex.failedpath</param-name>
<param-value>/wayback/index-data/failed</param-value>
<description>
If this value is present and non-empty, then CDX files that
fail to
parse successfully are moved to this directory after a single
attempt. If this option is missing or blank, malformed CDX
files are
left in the incoming directory and repeatedly re-attempted
until
some other process moves them out of the way or fixes them.
</description>
</context-param>
<context-param>
<param-name>resourceindex.mergeinterval</param-name>
<param-value>10000</param-value>
<description>
Millisecond interval between checks for new files in the
incoming
directory. This is only the starting number, when no new
files are
found in the directory. Each subsequent interval will
increase by
this number of ms, until a file is found, at which point the
interval will revert to the initial level.
</description>
</context-param>
<context-param>
<param-name>maxresults</param-name>
<param-value>1000</param-value>
<description>
Maximum number of results to return from the ResourceIndex.
</description>
</context-param>
<!-- END OF Local-BDB ResourceIndex OPTIONS -->
</web-app>
Thanks,
Natalia
|
|
From: Mathew B. <Mat...@na...> - 2008-01-06 22:24:57
|
Hi Natalia, make sure that the "incoming" directory in your resourceIndex configurati= on is the same directory as the "target" directory from your resourceStor= e configuration.=20 Hope this helps m. >>> Natalia Torres <nt...@ce...> 5/01/2008 12:18 a.m. >>> Hello I installed wayback 0.8 following the instructions on web page: placing=20 .war file in appropriate location, Waiting for Tomcat to unpack the .war = file, customizing base wayback.xml file as and restarting tomcat. I customize web.xml for using the wayback machine in timeline access=20 mode using - Local-ARC ResourceStore OPTIONS resourcestore.autoindex =3D1 resourcestore.indexinterval =3D 1000 set all path - Local-BDB ResourceIndex OPTIONS resourceindex.mergeinterval =3D 10000 maxresults =3D 1000 After 03 days from the tomcat restart i can't search many os urls on=20 those arcs. The arc directory () contains 6548 arc.gz files; listing then=20 information generated by wayback the index-data/merged dir has only 14=20 files and the arc-index/word dir the 6534 other files. The log file has no error messages and is reading arc files How many time is needed to index the arcs? More changes on the=20 configuration? Thanks N. -------------------------------------------------------------------------= This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=20 _______________________________________________ Archive-access-discuss mailing list Arc...@li...=20 https://lists.sourceforge.net/lists/listinfo/archive-access-discuss This e-mail is intended for the addressee only and may contain informatio= n which is subject to legal privilege. The contents are not necessarily t= he official view or communication of the National Library of New Zealand.= =20If you are not the intended recipient you must not use, disclose, copy= =20or distribute this e-mail or any information in, or attached to it. If= =20you have received this e-mail in error, please contact the sender imme= diately or return the original message to the National Library by e-mail,= =20and destroy any copies. The National Library does not accept any liabi= lity for changes made to this e-mail or attachments after sending. =20 All e-mails have been scanned for viruses and content by security softwar= e. The National Library reserves the right to monitor all e-mail communic= ations through its network. |