Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 [arcreader] Fetch records and iterate remote ARCs - ID: 1387423
Last Update: Comment added ( karl-ia )

Add feature so can iterator or fetch a record from a
remote ARC -- i.e. an ARC at the end of an URL --
without needing to copy the total ARC local first.


Michael Stack ( stack-sf ) - 2005-12-21 19:09

7

Closed

None

Michael Stack

scripts

1.10.0

Public


Comments ( 6 )

Date: 2007-03-14 01:45
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-989 -- please add further
comments at that location.


Date: 2006-08-21 19:14
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Actually close.


Date: 2006-08-19 00:21
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added a user-agent. Here's from an apache access log.

127.0.0.1 - - [18/Aug/2006:17:12:31 -0700] "GET /test.arc.gz
HTTP/1.1" 206 14858 "-" "org.archive.io.arc.ARCReader" "-"

Thanks for testing Karl.

Closing. Commit below:

Finish '[ 1387423 ] [arcreader] Fetch records and iterate
remote ARCs'
* src/java/org/archive/io/WriterPoolMember.java
* src/java/org/archive/io/arc/ARCReader.java
Javadoc edit.
* src/java/org/archive/io/arc/ARCReaderFactory.java
Add a user agent.
* src/java/org/archive/io/arc/ARCReaderFactoryTest.java
Minor change to commented-out test (Test is commented
out because
requires net but is useful testing ARCReading over http).



Date: 2006-08-18 23:38
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

add user-agent if possible then close


Date: 2006-04-12 22:51
Sender: karl-ia

Logged In: YES
user_id=1269624

This works just fine, but doesn't supply a User-Agent to the
HTTP server it's getting the remote ARC from. Something
like "ARCreader 1.8.x (archive.org)" might be appropriate here.


Date: 2005-12-21 19:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Finished up. Now possible to pass an URL to ARC and have it
read iterate over records across network stream.

Assigning Karl.

TO TEST:

+ Verify ARCReader works at it used to (Don't have to spend
too much time on this 'cos lots of unit tests that verify
ARCReader operation -- but maybe I broke the cmdline
interface some?).
+ Test using ARCReader to read remote ARCs specified with an
HTTP URL.


Finish up on '[ 1387423 ] [arcreader] Fetch records and
iterate remote ARCs'/
* src/java/org/archive/io/RepositionableInputStream.java
Add constructor so can set buffer size. Added javadoc of
experience
using this instance. Fix our bypassing underlying
BufferedInputStream
buffer by calling its read rather than read against
passed stream
directly. Add overrides to ensure position always gets
updated on read.
(position): Cleaned up math.
* src/java/org/archive/io/RepositionableInputStreamTest.java
More thorough, awkward testing of RepositionableInputStream.
* src/java/org/archive/io/arc/ARCReader.java
(alignedOnFirstRecord): Added. Flag is false when
backing stream did
not start at zeroth position (e.g. the ARCReader was
made against an
offset into a stream).
* src/java/org/archive/io/arc/ARCReaderFactory.java
Moved the gets around so gets that take File are grouped
together. Same
for those that take URL. Added argument to the
compressed arc reader
constructor -- an alignedOnFirstRecord argument. I
removed the
get override for URL that allowed specifying an offset.
Wait till someone
needs it. Meantime, its clear which get pulls ARCs local
-- the one
that doesn't take an offset. The one that does take an
offset tries to
read the ARCs remotely (Even if offset is zero -- note
its now possible
to iterate over complete ARC contents over a network
stream).
* src/java/org/archive/io/arc/ARCReaderFactoryTest.java
Changed commented out test used testing reading ARCs
over net.


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
close_date - 2006-08-21 19:15 stack-sf
status_id Open 2006-08-21 19:15 stack-sf
artifact_group_id 1.8.0 2006-08-18 23:38 gojomo
assigned_to karl-ia 2006-04-12 22:51 karl-ia
assigned_to stack-sf 2005-12-21 19:22 stack-sf