Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 GzippedInputStream class is not thread-safe - ID: 1218019
Last Update: Comment added ( karl-ia )

Creating one GzippedInputStream per file and reading
from them the first line of each in parallel (using one
Thread per file) throws "Corrupt GZIP trailer"
IOExceptions for each Thread at arbitrary positions.

The problem is the way how GzippedInputStream skips
over remaining record content. It uses a static byte[]
buffer where each thread puts in a certain number of
bytes until it's finished.

The attached patch now makes GZIPInputStream use
InflaterInputStream's own skip() method for that.

Christian


Christian Kohlschütter ( ck-heritrix ) - 2005-06-10 08:17

5

Closed

Fixed

Nobody/Anonymous

Disk I/O

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:54
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-430 -- please add further
comments at that location.


Date: 2005-06-10 19:53
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

(duh!) On my bike, had the thought that if anything, this
patch should improve throughput (The skip at least is not
doing copy to passed in read buffer)

Committed.

Nice patch Christian.

Below is commit message.

Fix for '[ 1218019 ] GzippedInputStream class is not
thread-safe' from
Christian Kohlshuetter.

Here is comment from Christian:

" Creating one GzippedInputStream per file and reading
from them the first line of each in parallel (using one
Thread per file) throws "Corrupt GZIP trailer"
IOExceptions for each Thread at arbitrary positions.

The problem is the way how GzippedInputStream skips
over remaining record content. It uses a static byte[]
buffer where each thread puts in a certain number of
bytes until it's finished.

The attached patch now makes GZIPInputStream use
InflaterInputStream's own skip() method for that."

* src/java/org/archive/io/GzippedInputStream.java
Use inflator#skip instead of inflator#read skipping over
content
we're not interested in.


.


Date: 2005-06-10 19:26
Sender: ck-heritrix

Logged In: YES
user_id=1220421

Though I haven't profiled the method, I guess it has no negative
impact; at least in the regular scenario, where the content of each
member has already been read. gotoEOR() then only has to read
the few remaining trailing bytes (if any). For this, 512 bytes are
enough.

However, this buffer is an J2SE-implementation-specific one. If I
look at Sun J2SE 5.0's InflaterInputStream, I do not really
understand why skip() requires a fixed 512-byte buffer, whereas
fill() uses a different one, whose size can be specified via the
constructor.

Christian



Date: 2005-06-10 17:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I like this change Christian. Any noticeable degradation in
performance? (My experience w/ inflater is that it uses
small chunks -- 500 bytes or so -- w/ not obvious way of
influencing size of buffers). If not, I'll commit.
Otherwise, might make sense to change the big SKIP_BUFFER to
be not static?


Attached File ( 1 )

Filename Description Download
GzippedInputStream.patch GzippedInputStream patch Download

Changes ( 5 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:27 gojomo
status_id Open 2005-06-10 19:53 stack-sf
resolution_id None 2005-06-10 19:53 stack-sf
close_date - 2005-06-10 19:53 stack-sf
File Added 137870: GzippedInputStream.patch 2005-06-10 08:17 ck-heritrix