From: Søren V. C. <sv...@kb...> - 2013-02-01 11:31:57
|
Hi all. I have installed wayback 1.7.1-SNAPSHOT, built myself directly from the pom.xml after downloading the code from https://github.com/internetarchive/wayback I'm using the locationDBResourceStore that the CDXCollection.xml uses, and it can find the correct files from the CDX. However, it fails to extract the record, as it somehow assumes that all files are GZIPPED, and when it is now, it fails miserably with the following log-entries: Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource INFO: Fetching: /home/prod/wayback/arcs/83807-92-0000-1.arc : 39136770 Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource WARNING: ResourceNotAvailable for /home/prod/wayback/arcs/83807-92-0000-1.arc Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.LocationDBResourceStore retrieveResource INFO: Unable to retrieve /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format /20100107153228/http://www2.kb.dk/elib/mss/skatte/aeldre_danske/ln185.htm Can anyone help me here? /Søren --------------------------------------------------------------------------- Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark tlf: (+45) 33 47 48 41 email: sv...@kb... ---------------------------------------------------------------------------- Non omnia possumus omnes --- Macrobius, Saturnalia, VI, 1, 35 ------- |
From: Søren V. C. <sv...@kb...> - 2013-02-05 13:54:05
|
Hi all. I have found the problem. It was in the wayback-core module in the class org.archive.wayback.resourcestore.resourcefile.ResourceFactory.getResource(File file, long offset) The method-call “ARCReaderFactory.get(path.getName(), is, false);” assumes, that the file is a gzipped ARC-file, even though the getResource method should work for both compressed and uncompressed arc-files The solution is to replace this call with ARCReaderFactory.get(file, offset). This makes the method work for both compressed and uncompressed arc-files. /Søren V. Carlsen (Royal Library, Copenhagen) Fra: Søren Vejrup Carlsen [mailto:sv...@kb...] Sendt: 1. februar 2013 12:32 Til: arc...@li... Emne: [Archive-access-discuss] Workaround for locationDBResourceStore bug in 1.7.1-SNAPSHOT Hi all. I have installed wayback 1.7.1-SNAPSHOT, built myself directly from the pom.xml after downloading the code from https://github.com/internetarchive/wayback I’m using the locationDBResourceStore that the CDXCollection.xml uses, and it can find the correct files from the CDX. However, it fails to extract the record, as it somehow assumes that all files are GZIPPED, and when it is now, it fails miserably with the following log-entries: Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource INFO: Fetching: /home/prod/wayback/arcs/83807-92-0000-1.arc : 39136770 Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource WARNING: ResourceNotAvailable for /home/prod/wayback/arcs/83807-92-0000-1.arc Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.LocationDBResourceStore retrieveResource INFO: Unable to retrieve /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format /20100107153228/http://www2.kb.dk/elib/mss/skatte/aeldre_danske/ln185.htm Can anyone help me here? /Søren --------------------------------------------------------------------------- Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark tlf: (+45) 33 47 48 41 email: sv...@kb...<mailto:sv...@kb...> ---------------------------------------------------------------------------- Non omnia possumus omnes --- Macrobius, Saturnalia, VI, 1, 35 ------- |
From: Noah L. <nl...@ar...> - 2013-02-06 02:51:18
|
Hello Søren, I committed a fix to ARCReaderFactory in Heritrix for the issue you raised. See https://webarchive.jira.com/browse/HER-2032 Not sure how long that will take to appear in a wayback build. Noah On 02/05/2013 05:53 AM, Søren Vejrup Carlsen wrote: > > Hi all. > > I have found the problem. It was in the wayback-core module in the > class > org.archive.wayback.resourcestore.resourcefile.ResourceFactory.getResource(File > > file, long offset) > > The method-call "ARCReaderFactory.get(path.getName(), is, false);" > > assumes, that the file is a gzipped ARC-file, even though the > getResource method should work for both compressed > > and uncompressed arc-files? > > The solution is to replace this call with ARCReaderFactory.get(file, > offset). > > This makes the method work for both compressed and uncompressed arc-files. > > /Søren V. Carlsen (Royal Library, Copenhagen) > > *Fra:*Søren Vejrup Carlsen [mailto:sv...@kb...] > *Sendt:* 1. februar 2013 12:32 > *Til:* arc...@li... > *Emne:* [Archive-access-discuss] Workaround for > locationDBResourceStore bug in 1.7.1-SNAPSHOT > > Hi all. > > I have installed wayback 1.7.1-SNAPSHOT, built myself directly from > the pom.xml after downloading the code from > https://github.com/internetarchive/wayback > > I'm using the locationDBResourceStore that the CDXCollection.xml uses, > and it can find the correct files from the CDX. > > However, it fails to extract the record, as it somehow assumes that > all files are GZIPPED, and when it is now, it fails miserably with the > following log-entries: > > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource > INFO: Fetching: /home/prod/wayback/arcs/83807-92-0000-1.arc : 39136770 > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource > WARNING: ResourceNotAvailable for > /home/prod/wayback/arcs/83807-92-0000-1.arc Not in GZIP format > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.LocationDBResourceStore retrieveResource > INFO: Unable to retrieve /home/prod/wayback/arcs/83807-92-0000-1.arc - > java.util.zip.ZipException: Not in GZIP format > Jan 31, 2013 6:49:18 PM org.archive.wayback.webapp.AccessPoint > handleReplay > WARNING: (1)LOADFAIL: /home/prod/wayback/arcs/83807-92-0000-1.arc - > java.util.zip.ZipException: Not in GZIP format > /20100107153228/http://www2.kb.dk/elib/mss/skatte/aeldre_danske/ln185.htm > > Can anyone help me here? > > /Søren > > --------------------------------------------------------------------------- > > Søren Vejrup Carlsen, Department of Digital Preservation, Royal > Library, Copenhagen, Denmark > > tlf: (+45) 33 47 48 41 > > email: sv...@kb... <mailto:sv...@kb...> > > ---------------------------------------------------------------------------- > > Non omnia possumus omnes > > --- Macrobius, Saturnalia, VI, 1, 35 ------- > > > > ------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |