Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 'empty' records in compressed arc files - ID: 856555
Last Update: Comment added ( karl-ia )

ARCWriter module occasionally writes empty records to
arc files when 'compress option' is set to true. From
time to time in compressed arc files, created by this
module, empty gzip files/records were discovered. These
empty records contain gzip header
(31-139-8-0-0-0-0-0-0-0) followed by empty content
(3-0-0-0-0-0-0-0-0-0). Occurrence of these records in
an arc file has no patterns (URLs, mime types, response
codes, or lengths) related to preceding or following
records of the same arc file.


Igor Ranitovic ( ia_igor ) - 2003-12-08 23:49

8

Closed

None

Igor Ranitovic

General

None

Public


Comments ( 6 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-43 -- please add further
comments at that location.


Date: 2004-02-12 18:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Really closing.


Date: 2004-02-10 17:47
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

No recurrence since fix was submitted (according to mighty
Igor). Closing. Will reopen if we come across it again.


Date: 2004-02-10 17:41
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Really closing.


Date: 2004-02-10 17:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

No recurrence since fix was submitted (according to mighty
Igor). Closing. Will reopen if we come across it again.


Date: 2003-12-11 00:25
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I've committed a probable fix.

ARCWriter was triggering the GZIP header and footer even for
certain cases where there was nothing to write. (Not sure
exactly when these were, as such CrawlURIs would properly
have a fetchStatus of 0 or negative, causing the entire
ARCWriter routine to exit early.)

In any case, I moved the code to emit the header and
close-up footer closer to the writing of an actual record --
so that only when a record is written should the GZIP
pseudo-open/close occur.

After our next significant Heritrix crawl with this new
code, please check to see if any of the ARCs exhibit this
problem.



Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
close_date 2004-02-10 17:41 2004-02-12 18:46 stack-sf
status_id Open 2004-02-12 18:46 stack-sf
status_id Closed 2004-02-10 17:47 stack-sf
close_date - 2004-02-10 17:41 stack-sf
status_id Open 2004-02-10 17:41 stack-sf
assigned_to gojomo 2003-12-11 00:25 gojomo
assigned_to nobody 2003-12-10 20:03 gojomo