Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 Link puts garbage into arc file: http://www.msn.com/robots.t - ID: 874057
Last Update: Comment added ( karl-ia )

Crawling this link http://www.msn.com/robots.txt
generates junk into the arc file. Its probably because
the pages is UTF-16
(<meta http-equiv="Content-Type" content="text/html;
charset=UTF-16">). I'd guess we're not respecting the
page encoding and are mangling it when we write to disk.


Michael Stack ( stack-sf ) - 2004-01-09 21:16

5

Closed

Invalid

Michael Stack

Protocols

0.2.0

Public


Comments ( 3 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-49 -- please add further
comments at that location.


Date: 2004-02-18 21:41
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Yeah, looking at raw UTF16 it'd look like garbage looking at
it in anything that didn't respect encodings. I checked arc
content and its the unmangled file so thats good; it makes
this issue invalid. Closing.

The file that was responsible for the filing of this issue
is interesting in that it is an html page returned w/ a 404
code when we ask for robots.txt. It has links in it that we
won't find because the page is UTF16 and so inscrutable to
us w/o specifying a char stream encoding when we read.

Because the page is UTF16 we can't read the encoding from
the HTML HEAD META Content-Type tag. We have to rely on the
HTTP Headers. In this case, it doesn't give the
encoding/charset; only the mimetype.

So the page remains inscrutable unless we introduce charset
detection. Making a feature request...


Date: 2004-02-03 19:18
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Good find.

I knew we were not respecting encodings properly for
analyzing/extraction (ReplayCharSequence) -- but the
ReplayInputStream should be byte-for-byte the same as the
initial read. Are you sure we're mangling the contents?
Looking at raw UTF-16 would look like garbage compared to
other encodings...


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-02-18 21:41 stack-sf
resolution_id None 2004-02-18 21:41 stack-sf
close_date - 2004-02-18 21:41 stack-sf
assigned_to nobody 2004-02-03 19:18 gojomo