Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 First arc record length is off by one - ID: 1062621
Last Update: Comment added ( karl-ia )

Bjarne reported problem on the list...

In ARCWriter#generateARCFileMetaData it does this after
writing the metadata:

// Write out a couple of LINE_SEPARATORs to end
this record.
metabaos.write(("" + LINE_SEPARATOR +
LINE_SEPARATOR).
getBytes(DEFAULT_ENCODING));

These two LINE_SEPARATOR's are not counted.

The length should be fixed (It was probably done like
this in an attempt at mimicing how the alexa tools
write ARCs). I've made an issue. Thanks for pointing
it out Bjarne.
St.Ack


Bjarne Andersen wrote:

> It seems to me that the filedesc:// URL-record in the
generated
> ARC-files has an error
> There are 2 newlines after the content which causes
the length of the
> record to be short by one
> Here is an example:
>
> filedesc://IAH-20041108114937-00000-asterix.arc
0.0.0.0 20041108114937 text/plain 1217
> 1 1 InternetArchive
> URL IP-address Archive-date Content-type Archive-length
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <arcmetadata xmlns="http://archive.org/arc/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:arc="http://archive.org/arc/1.0/"
xsi:schemaLocation="http://archive.org/arc/1.0/
http://www.archive.org/arc/1.0/arc.xsd">
> <arc:software>Heritrix 1.0.4
http://crawler.archive.org</arc:software>
> <arc:hostname>asterix</arc:hostname>
> <arc:ip>130.226.231.5</arc:ip>
> <dcterms:isPartOf>Simple</dcterms:isPartOf>
> <dc:description>Profile: Simple crawl</dc:description>
> <arc:operator>Admin</arc:operator>
> <dc:date
xsi:type="dcterms:W3CDTF">2004-11-02T08:27:41+00:00</dc:date>
> <arc:http-header-user-agent>Mozilla/5.0 (compatible;
heritrix/1.0.4
+http://www.netarkivet.dk/website/info.html)</arc:http-header-user-agent>
>
<arc:http-header-from>bja@statsbiblioteket.dk</arc:http-header-from>
> <arc:robots>classic</arc:robots>
> <dc:format>ARC file version 1.1</dc:format>
> <dcterms:conformsTo
xsi:type="dcterms:URI">http://www.archive.org/web/researcher/ArcFileFormat.
php</dcterms:conformsTo>
> </arcmetadata>
>
>
> dns:www.netarkivet.dk 130.226.220.16 20041108114937
text/dns 61
>
> The length says 1217 bytes - but it seems to me that
the length is
> really 1218 bytes ??
>
> best
> Bjarne Andersen


Michael Stack ( stack-sf ) - 2004-11-08 19:33

5

Closed

Fixed

Michael Stack

Disk I/O

None

Public


Comments ( 2 )

Date: 2007-03-14 00:18
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-288 -- please add further
comments at that location.


Date: 2005-02-08 01:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I removed extra LINE_SEPARATOR. After testing new format
ARCs with av_procarc and ARCReader, committed. Closing.


Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
status_id Open 2005-02-08 01:07 stack-sf
resolution_id None 2005-02-08 01:07 stack-sf
close_date - 2005-02-08 01:07 stack-sf