Bjarne reported problem on the list...
In ARCWriter#generateARCFileMetaData it does this after
writing the metadata:
// Write out a couple of LINE_SEPARATORs to end
this record.
metabaos.write(("" + LINE_SEPARATOR +
LINE_SEPARATOR).
getBytes(DEFAULT_ENCODING));
These two LINE_SEPARATOR's are not counted.
The length should be fixed (It was probably done like
this in an attempt at mimicing how the alexa tools
write ARCs). I've made an issue. Thanks for pointing
it out Bjarne.
St.Ack
Bjarne Andersen wrote:
> It seems to me that the filedesc:// URL-record in the
generated
> ARC-files has an error
> There are 2 newlines after the content which causes
the length of the
> record to be short by one
> Here is an example:
>
> filedesc://IAH-20041108114937-00000-asterix.arc
0.0.0.0 20041108114937 text/plain 1217
> 1 1 InternetArchive
> URL IP-address Archive-date Content-type Archive-length
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <arcmetadata xmlns="http://archive.org/arc/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:arc="http://archive.org/arc/1.0/"
xsi:schemaLocation="http://archive.org/arc/1.0/
http://www.archive.org/arc/1.0/arc.xsd">
> <arc:software>Heritrix 1.0.4
http://crawler.archive.org</arc:software>
> <arc:hostname>asterix</arc:hostname>
> <arc:ip>130.226.231.5</arc:ip>
> <dcterms:isPartOf>Simple</dcterms:isPartOf>
> <dc:description>Profile: Simple crawl</dc:description>
> <arc:operator>Admin</arc:operator>
> <dc:date
xsi:type="dcterms:W3CDTF">2004-11-02T08:27:41+00:00</dc:date>
> <arc:http-header-user-agent>Mozilla/5.0 (compatible;
heritrix/1.0.4
+http://www.netarkivet.dk/website/info.html)</arc:http-header-user-agent>
>
<arc:http-header-from>bja@statsbiblioteket.dk</arc:http-header-from>
> <arc:robots>classic</arc:robots>
> <dc:format>ARC file version 1.1</dc:format>
> <dcterms:conformsTo
xsi:type="dcterms:URI">http://www.archive.org/web/researcher/ArcFileFormat.
php</dcterms:conformsTo>
> </arcmetadata>
>
>
> dns:www.netarkivet.dk 130.226.220.16 20041108114937
text/dns 61
>
> The length says 1217 bytes - but it seems to me that
the length is
> really 1218 bytes ??
>
> best
> Bjarne Andersen
Michael Stack
Disk I/O
None
Public
|
Date: 2007-03-14 00:18
|
|
Date: 2005-02-08 01:07 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-02-08 01:07 | stack-sf |
| resolution_id | None | 2005-02-08 01:07 | stack-sf |
| close_date | - | 2005-02-08 01:07 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use