Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 Heritrix ignores charset - ID: 896878
Last Update: Comment added ( karl-ia )

Does a
get.getHttpRecorder().getRecordedInput().getCharSequence()
which gives back a char sequence that assumes
singlebyte encoding. Parse will fail going against
multibytes that don't have single-byte ascii at the
base of the encoding.

Looks like the extractor needs to get the charset from
the HTML HEAD content-type meta tag and then back up
the stream if it gets anything other than the current
JVM encoding and ask for a char sequence of the found
encoding (Need to add support for this to
ReplayCharSequence).


Michael Stack ( stack-sf ) - 2004-02-14 01:38

5

Closed

Fixed

Michael Stack

Disk I/O

None

Public


Comments ( 6 )

Date: 2007-03-14 00:07
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-71 -- please add further
comments at that location.


Date: 2004-03-10 19:58
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed an implementation that will return a
character-based ReplayCharSequence if the HTTP Header
Content-Type charset is judged multibyte; otherwise it
returns old byte orientated ReplayCharSequence. Closing
this issue as fixed. I made a new RFE, "[ 913687 ] Make
extractors interrogate for charset", to cover the
outstanding need for making extractors interrogate for
charset (And for the case where charset is not supplied,
there is "[ 899909 ] Add charset detection to processing")



Date: 2004-02-27 20:09
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Talking some w/ gordon about the ReplayCharSequence
implementation, of note:

+ Usually, one extractor only in processing stream for one
mimetype .
+ Usually file will fit into buffer; won't have to resort to
backing file.

Considering above, for first cut at multibyte aware
ReplayCharStream, we'd test if file fits in buffer, if it
does.. just decode into new CharBuffer and use this for
CharAt'ing through. If file doesn't fit in buffer, use
buffer and backing file to write a new utf16 file and do
CharAt'ing out of the new file using memory-backed file
channel (Makes implementation easy).

Implementation v2 would add optimizations that would keep
around decoded CharBuffer/cache file across extractor
invocations w/i a single URI processing (The
HttpRecorder#cleanup would be extended to release objects
and files we're done w/).

Also of note, if memory-backed file channels make
implementation easier -- even if they are ten times slower
-- we might go w/ this technique because rare is the case
where file is > than buffer.


Date: 2004-02-27 18:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Changed issue summary from 'Extractor ignores charset' to
'Heritrix ignores charset'. Issue has expanded. Below is
description of scope, outline of work and estimate.

Heritrix has fetchers get content. The fetchers save the
fetched content off in file-backed buffers. Then, per
extractor mentioned in the chain, each extractor asks to
have the content played back, often as a CharSequence so can
regex through the content. Other times, such as the
writing of ARC files, extractors need to treat the stream as
raw, unadulterated bytes.

Currently when extractors want a CharSequence, the
assumption is that the stream is made up of single-bytes
always. Regex's going against a multi- or variable-byte
stream regarded as single byte will give odd results (E.g. A
regex might find a '<' and treat it as an angle bracket
when it in fact 0x3c might be byte 2 of 0x4e3c, the CJK
Ideograph for a 'bowl of food' in Unicode).

So, we need to make it so extractors can get back a
CharSequence that will happily skip through streams of
multibyte characters.

There are various locations where charset is specified: As
a suffix to the HTTP Content-Type header; in the HTML HEAD
Content-Type META tag, in the first line of an XML document,
etc. We can divide the locations into two types: Locations
the fetcher knows of and locations the extractor knows of.

I propose heritrix does charset handling in following fashion.

+ Add to HttpRecorder, RecordingInputStream and to
RecordingOutputStream
setCharacterEncoding/getCharacterEncoding methods. Calls to
getCharSequence will use the last passed character encoding
setting manufacturing a CharSequence object to return.
+ Fetchers look at fetcher locations for charset encoding
and if available, set it into the HttpRecorder or
Recording*Stream.
+ Subsequently, extractors may change or improve upon the
charset encoding. Extractors know how to interrogate the
document type they were written against. Extractors should
be changed so the first thing they do is figure charset
encoding. For example, the ExtractorHTTP should first try
and find the HEAD META Content-Type field and see if it has
a charset. If it does, and if its different from the
charset of the CharSequence currently being processed,
ExtractorHTTP should set the newly found encoding into the
Recording*Stream, discard the current CharSequence and
restart the processing w/ a newly got CharSequence.
+ ReplyCharSequence is what is returned when Extractors ask
for a CharSequence. Proposal is that ReplayCharSequence now
takes a charset encoding. In the constructor it examines
the charset. If the charset is empty or throws an
UnknownEncodingException or is member of a set of known
aliases for single-byte charsets, we'll return the old
implementation. Otherwise, we'll return a new CharSequence
implementation, one cognizant of multibyte handling. The
probable implementation will take the Recording*Stream
buffer and backing file, decode them using the passed
charset encoding and cache the result to a file written in
JVM's native encoding -- UTF16BE, an encoding w/ chars of
regular byte size -- so we save on encoding transforms every
time its read in. When extractors ask for a CharSequence,
they'll be given back a CharSequence that goes against this
file cache. We write the cache because there'll usually be
more than just one extractor wanting the multibyte sequence
and decoding is CPU intensive/expensive.

The above should improve our character encoding handling
significantly though, it has to be said, there will always
be inscrutable content (e.g. The recent case where we got
UTF16 404 page which did not specify in the HTTP
Content-Type that page was UTF16. In this case extractors
will find treat the 2nd byte of the chinese ideogram 'bowl
of food' as a '<' and there is nought we can do about it).

I'd like to take the time to profile difference between
using memory-mapped channels vs. inputstreams writing the
cache. The knowledge gained will help inform future
filesystem I/O -- as opposed to socket channel I/O for it
seems as though this research has already been done (Was it
written up anywhere) -- heritrix decisions.

I estimate 3/5/7 days to do the above work.


Date: 2004-02-19 20:44
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Some notes from the httpclient site:

"You can set the content type header for a request with the
addRequestHeader method in each method and retrieve the
encoding for the response body with the getResponseCharSet
method."


"If the response is known to be a String, you can use the
getResponseBodyAsString method which will automatically use
the encoding specified in the Content-Type header or
ISO-8859-1 if no charset is specified."


From
http://jakarta.apache.org/commons/httpclient/charencodings.html



Date: 2004-02-19 17:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Some notes from the httpclient site:

"You can set the content type header for a request with the
addRequestHeader method in each method and retrieve the
encoding for the response body with the getResponseCharSet
method."


"If the response is known to be a String, you can use the
getResponseBodyAsString method which will automatically use
the encoding specified in the Content-Type header or
ISO-8859-1 if no charset is specified."


From
http://jakarta.apache.org/commons/httpclient/charencodings.html



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
close_date - 2004-03-10 19:58 stack-sf
resolution_id None 2004-03-10 19:58 stack-sf
status_id Open 2004-03-10 19:58 stack-sf
summary ExtractorHTTP ignores charset 2004-02-27 18:42 stack-sf
assigned_to nobody 2004-02-17 22:32 gojomo