Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Specification of default CharSequence charset - ID: 989816
Last Update: Comment added ( karl-ia )

From the list:

stack writes:

>> What about letting the default be the JVMs file
encoding (System
>> property file.encoding?). Wouldn't this be
EUC_JP/Shift_JIS on japanese
>> machines?


Yes, it probably would. But not everyone crawling
foreign language
content is on a machine with the same language, so this
would break
down in that case. Much better (IMHO) would be to let
the user choose
explicitly for each job.

-tree

-- Tom Emerson Basis Technology Corp. Software
Architect http://www.basistech.com "Beware the lollipop
of mediocrity: lick it once and you suck forever"


Makes sense to me.

The metatags are not being read by extractors
currently, as you guess
below. There is an RFE to address this.

As heritrix currently works, if no charset, or if a
problem getting the
charset -- say its a charset name the JVM doesn't
recognize (x-sjis) --
then we default to single-byte. Would make sense that
it be possible
to specify the default to use, especially if default is
multibyte.

What about letting the default be the JVMs file
encoding (System
property file.encoding?). Wouldn't this be
EUC_JP/Shift_JIS on japanese
machines?

St.Ack


Tom Emerson wrote:


>>I've been thinking about this some more (don't tell
my boss!), and
>>I think I understand what may be going on based on a
reading of the
>>code. Consider this scenario:
>>
>>1) You are crawling a Japanese(*) site.
>>
>>2) The pages are returned with "Content-Type:
text/html". Note that
>> no charset is specified in the HTTP response header.
>>
>>3) The pages display just fine in a Japanese(*)
browser, but that is
>> to be expected because it has probably been
configured to use a
>> Japanese encoding by default.
>>
>>(*) Japanese is used here, but the same is true for
Chinese or Korean
>>or pretty much any other multibyte language..
>>
>>Here's the problem:
>>
>>The Content-Type does *not* specify the encoding
(charset) of the
>>content. This means that the call to
getResponseCharSet() will return
>>null, and isMultibyteEncoding in
ReplayCharSequenceFactory will return
>>false. At this point no transcoding is done from the
document
>>character set into Unicode (the JVM's character set)
and it is
>>possible that the Extractors will get confused, as
described in the
>>Wiki. Consider:
>>
>>(0) tree% wget -S -O test.html
http://jajatom.moo.jp/Top.html
>>--18:14:36-- http://jajatom.moo.jp/Top.html
>> => `test.html'
>>Resolving jajatom.moo.jp... done.
>>Connecting to jajatom.moo.jp[202.3.141.232]:80...
connected.
>>HTTP request sent, awaiting response...
>> 1 HTTP/1.1 200 OK
>> 2 Date: Mon, 12 Jul 2004 22:14:37 GMT
>> 3 Server: Apache/1.3.29 (Unix) PHP/4.2.4-dev
>> 4 Last-Modified: Sun, 27 Jun 2004 04:38:00 GMT
>> 5 ETag: "116fb4-3498-40de4f28"
>> 6 Accept-Ranges: bytes
>> 7 Content-Length: 13464
>> 8 Keep-Alive: timeout=1, max=100
>> 9 Connection: Keep-Alive
>>10 Content-Type: text/html
>>
>> 0K .......... ...
100% 17.44 KB/s
>>
>>18:14:37 (19.34 KB/s) - `test.html' saved [13464/13464]
>>
>>The file has an http-equiv metatag in its header,
however, that
>>specifies the charset:
>>
>><META Http-Equiv="Content-Type" Content="text/html;
charset=x-sjis">
>>
>>It does not look like the metatag is being processed,
so the charset
>>is never being set.
>>
>>In this particular case even if the HTTP response
header had included
>>the charset, the ReplayCharSequenceFactory would not
identify it as
>>multibyte because 'x-sjis' is not one of the
recognized character
>>sets.
>>
>>I can envison adding a parameter to the crawler that
allows you to
>>define the "default" character set used for pages
that don't specify
>>them in their HTTP Content-Type response header. Lynx
has an option
>>for this called -assume-charset.
>>
>>Ideally the ReplayCharSequenceFactory would look for
the http-equiv
>>and make its decision based on the data there. Even
then it would be
>>useful to be able to specify the 'default'.
>>
>>Does any of this make sense?
>>
>> -tree
>>
>>
>>





Michael Stack ( stack-sf ) - 2004-07-12 23:21

5

Closed

None

Michael Stack

Configuration

None

Public


Comments ( 9 )

Date: 2007-03-14 01:32
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-801 -- please add further
comments at that location.


Date: 2004-08-06 21:19
Sender: nobody

Logged In: NO

To aredcomet:

I cannot download the gzipped arc.

I get this when I download: -rw-r--r-- 1 stack stack 404K
Aug 6 14:10
IAH-20040628073815-00000-localhost.localdomain.arc.gz

But gzip reports it corrupted. Is it corrupt on your end?
('gzip -t')?



Date: 2004-08-04 05:35
Sender: aredcomet

Logged In: YES
user_id=1081820

To stack-sf :

I'm put example ARCfile here.

http://ared.k2.xrea.com/

offset 221177 is SHIFT_JIS
offset 410564 is EUC_JP

http://ared.k2.xrea.com/file/IAH-20040628073815-00000-
localhost.localdomain.arc.gz


Date: 2004-08-03 16:11
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

To aredcomet:

Do you have examples?


Date: 2004-08-03 15:57
Sender: aredcomet

Logged In: YES
user_id=1081820

Hello.
Thank you response.

Character transformation is carried out by the ARC file.

EUC_JP site is OK!
Shift_JIS site is NOT GOOD.



Date: 2004-07-28 20:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Really close. (Was fixed a few weeks ago).


Date: 2004-07-19 21:53
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Reviewed and tested patch. Fixed long lines. Added
specification of default charset to option help. Committed.
Below was commit message. Closing.


Fix for [ 989816 ] Specification of default CharSequence charset
Fix submitted by Tom Emerson, tree at lists dot sourceforge
dot net
Here is his comment on the patch:

'...adds support for specifying the "default" charset used
when crawling
pages that do not specify on in the HTTP Content-Type
response header.

...

NOTE: the default value for the setting is the default
charset used by
the Commons HTTP client, which is ISO-8859-1. Previous
discussion
centered on using the file.encoding system property, but I
don't think
this is a good idea. The code is structured such that
changing this
discussion will be easily done.

Also, I noticed that option settings for the proxy hosts
were slightly
off: the code looked like they were supposed to be marked as
expert
settings, but the calls to addElementToDefinitions() did not
store the
returned element. I fixed this.'

* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Fix to make proxy setting expert setting (Was setting
'expert' on
the option setup just previous because wasn't catching
returned
option pointer).
(ATTR_DEFAULT_ENCODING, setCharacterEncoding): Added.



Date: 2004-07-19 17:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tom Emerson wrote:

>John Erik Halse writes:
>
>>+1 for both!
>>I suggest that Michael's approach should be used if no
encoding is
>>explicitly set by the user.
>
>
>I've been thinking about this some more. The System property
>file.encoding may not be what we want, without adding more
logic to
>the mix. For example, on an English Solaris 2.8 system the
property is
>ISO646-US, which is a subset of ISO 8859-1, the de facto
default for
>HTTP. On a RedHat 7.3 system the POSIX locale specifies
>"ANSI_X3.4-1968"...
>
>This may be OK, or it may not be. For now I'd be most
comfortable
>making the 'default' value 'ISO-8859-1'.
>
> -tree
>



Date: 2004-07-13 13:55
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

John Erik Halse wrote:

>+1 for both!
>I suggest that Michael's approach should be used if no
encoding is
>explicitly set by the user.
>
>- John
>



Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
close_date - 2004-07-28 20:19 stack-sf
status_id Open 2004-07-28 20:18 stack-sf