From: Michael S. <sta...@us...> - 2006-11-28 02:03:08
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler In directory sc8-pr-cvs11.sourceforge.net:/tmp/cvs-serv18291 Modified Files: .classpath maven.xml project.properties project.xml Log Message: Changes as result of tests with streaming ARCReader and using fixes so can use newer archive-commons in nutchwax. Also added the s3 URL handler here to heritrix from nutchwax so all URL handlers are sitting beside each other. Doesn't really belong here but its sojourn should be temporary. We'll move it out when we make the separate archive-commons project. * .classpath * project.properties * project.xml Add s3 jar. * maven.xml Include s3 handler in archive-commons. * src/java/org/archive/io/ArchiveReader.java (getDeleteFileOnCloseReader): Added. * src/java/org/archive/io/ArchiveReaderFactory.java Fix so we return WARC or ARC not just WARC if getting an ArchiveReader Stream. Added comment on result of tests ArchiveReading over an S3 stream. Made name of temporary file used when making-an-arc-local, the actual arc name rather than invent a temporary name. Swapped setting of useragent to before opening of url connection (Was throwing exception setting it afterward). Was returning an ArchiveReader when wrapping ARCReader or WARCReader so could get control on close and clean up temporary files. Fixed so wrapper is of appropriate class (WARC or ARC). * src/java/org/archive/io/RepositionableInputStream.java Add call to mark so underlying BufferedInputStream will save at least last read (Previous was just discarding old buffer when moved on to next and so whenever we wanted to backup, say, gzipinputstream had overread, we'd frequently fail). * src/java/org/archive/io/arc/ARCReader.java (getDeleteFileOnCloseReader): Added implementation. * src/java/org/archive/io/warc/WARCReader.java (getDeleteFileOnCloseReader): Added placeholder. Index: maven.xml =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/maven.xml,v retrieving revision 1.79 retrieving revision 1.80 diff -C2 -d -r1.79 -r1.80 *** maven.xml 15 Nov 2006 00:10:34 -0000 1.79 --- maven.xml 28 Nov 2006 02:03:02 -0000 1.80 *************** *** 115,118 **** --- 115,119 ---- <ant:arg value="org.archive.net.rsync.Handler.class" /> <ant:arg value="org.archive.net.md5.Handler.class" /> + <ant:arg value="org.archive.net.s3.Handler.class" /> <ant:arg value="org.apache.commons.pool.impl.FairGenericObjectPool" /> Index: project.properties =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/project.properties,v retrieving revision 1.102 retrieving revision 1.103 diff -C2 -d -r1.102 -r1.103 *** project.properties 17 Oct 2006 19:52:20 -0000 1.102 --- project.properties 28 Nov 2006 02:03:02 -0000 1.103 *************** *** 7,11 **** maven.compile.encoding=UTF-8 ! # Tell maven that we're JVM 1.4 exclusively. maven.compile.source = 1.5 maven.compile.target = 1.5 --- 7,11 ---- maven.compile.encoding=UTF-8 ! # Tell maven that we're JVM 1.5 exclusively. maven.compile.source = 1.5 maven.compile.target = 1.5 *************** *** 43,46 **** --- 43,47 ---- maven.jar.beanshell = ${basedir}/lib/bsh-2.0b4.jar maven.jar.jericho-html = ${basedir}/lib/jericho-html-2.3.jar + maven.jar.s3 = ${basedir}/lib/s3-20061030.jar # Below directives have to do w/ the MANIFEST-MF that gets generated by Maven. Index: project.xml =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/project.xml,v retrieving revision 1.153 retrieving revision 1.154 diff -C2 -d -r1.153 -r1.154 *** project.xml 17 Oct 2006 19:52:20 -0000 1.153 --- project.xml 28 Nov 2006 02:03:02 -0000 1.154 *************** *** 701,704 **** --- 701,725 ---- </properties> </dependency> + <dependency> + <id>s3</id> + <version>1.0.0</version> + <url>http://builds.archive.org:8080/cruisecontrol/buildresults/HEAD-heritrix</url> + <properties> + <war.bundle>true</war.bundle> + <description> + This jar contains code for accessing S3 AND + org.archive.net.s3.Handler.class. The S3 code is a subset of + code obtained at URL given above. Here's how I made this jar + (after changing some of the statics to have public rather than + default access and copying the s3 Handler.class from + Nutchwax build dir local): + 1131 javac com/amazon/thirdparty/Base64.java com/amazon/s3/Utils.java com/amazon/s3/AWSAuthConnection.java + + 1134 jar -cf s3-20061030.jar `find com org -name '*.class'` + TODO: Clean this up (Need to make sure all is working first). + </description> + <license /> + </properties> + </dependency> </dependencies> Index: .classpath =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/.classpath,v retrieving revision 1.91 retrieving revision 1.92 diff -C2 -d -r1.91 -r1.92 *** .classpath 17 Oct 2006 19:52:20 -0000 1.91 --- .classpath 28 Nov 2006 02:03:02 -0000 1.92 *************** *** 30,33 **** --- 30,34 ---- <classpathentry kind="lib" path="lib/commons-net-1.4.1.jar"/> <classpathentry kind="lib" path="lib/jericho-html-2.3.jar"/> + <classpathentry kind="lib" path="lib/s3-20061030.jar"/> <classpathentry kind="output" path="target"/> </classpath> |