From the list:
stack writes:
>> What about letting the default be the JVMs file
encoding (System
>> property file.encoding?). Wouldn't this be
EUC_JP/Shift_JIS on japanese
>> machines?
Yes, it probably would. But not everyone crawling
foreign language
content is on a machine with the same language, so this
would break
down in that case. Much better (IMHO) would be to let
the user choose
explicitly for each job.
-tree
-- Tom Emerson Basis Technology Corp. Software
Architect http://www.basistech.com "Beware the lollipop
of mediocrity: lick it once and you suck forever"
Makes sense to me.
The metatags are not being read by extractors
currently, as you guess
below. There is an RFE to address this.
As heritrix currently works, if no charset, or if a
problem getting the
charset -- say its a charset name the JVM doesn't
recognize (x-sjis) --
then we default to single-byte. Would make sense that
it be possible
to specify the default to use, especially if default is
multibyte.
What about letting the default be the JVMs file
encoding (System
property file.encoding?). Wouldn't this be
EUC_JP/Shift_JIS on japanese
machines?
St.Ack
Tom Emerson wrote:
>>I've been thinking about this some more (don't tell
my boss!), and
>>I think I understand what may be going on based on a
reading of the
>>code. Consider this scenario:
>>
>>1) You are crawling a Japanese(*) site.
>>
>>2) The pages are returned with "Content-Type:
text/html". Note that
>> no charset is specified in the HTTP response header.
>>
>>3) The pages display just fine in a Japanese(*)
browser, but that is
>> to be expected because it has probably been
configured to use a
>> Japanese encoding by default.
>>
>>(*) Japanese is used here, but the same is true for
Chinese or Korean
>>or pretty much any other multibyte language..
>>
>>Here's the problem:
>>
>>The Content-Type does *not* specify the encoding
(charset) of the
>>content. This means that the call to
getResponseCharSet() will return
>>null, and isMultibyteEncoding in
ReplayCharSequenceFactory will return
>>false. At this point no transcoding is done from the
document
>>character set into Unicode (the JVM's character set)
and it is
>>possible that the Extractors will get confused, as
described in the
>>Wiki. Consider:
>>
>>(0) tree% wget -S -O test.html
http://jajatom.moo.jp/Top.html
>>--18:14:36-- http://jajatom.moo.jp/Top.html
>> => `test.html'
>>Resolving jajatom.moo.jp... done.
>>Connecting to jajatom.moo.jp[202.3.141.232]:80...
connected.
>>HTTP request sent, awaiting response...
>> 1 HTTP/1.1 200 OK
>> 2 Date: Mon, 12 Jul 2004 22:14:37 GMT
>> 3 Server: Apache/1.3.29 (Unix) PHP/4.2.4-dev
>> 4 Last-Modified: Sun, 27 Jun 2004 04:38:00 GMT
>> 5 ETag: "116fb4-3498-40de4f28"
>> 6 Accept-Ranges: bytes
>> 7 Content-Length: 13464
>> 8 Keep-Alive: timeout=1, max=100
>> 9 Connection: Keep-Alive
>>10 Content-Type: text/html
>>
>> 0K .......... ...
100% 17.44 KB/s
>>
>>18:14:37 (19.34 KB/s) - `test.html' saved [13464/13464]
>>
>>The file has an http-equiv metatag in its header,
however, that
>>specifies the charset:
>>
>><META Http-Equiv="Content-Type" Content="text/html;
charset=x-sjis">
>>
>>It does not look like the metatag is being processed,
so the charset
>>is never being set.
>>
>>In this particular case even if the HTTP response
header had included
>>the charset, the ReplayCharSequenceFactory would not
identify it as
>>multibyte because 'x-sjis' is not one of the
recognized character
>>sets.
>>
>>I can envison adding a parameter to the crawler that
allows you to
>>define the "default" character set used for pages
that don't specify
>>them in their HTTP Content-Type response header. Lynx
has an option
>>for this called -assume-charset.
>>
>>Ideally the ReplayCharSequenceFactory would look for
the http-equiv
>>and make its decision based on the data there. Even
then it would be
>>useful to be able to specify the 'default'.
>>
>>Does any of this make sense?
>>
>> -tree
>>
>>
>>
Michael Stack
Configuration
None
Public
|
Date: 2007-03-14 01:32
|
|
Date: 2004-08-06 21:19 Logged In: NO |
|
Date: 2004-08-04 05:35 Logged In: YES |
|
Date: 2004-08-03 16:11 Logged In: YES |
|
Date: 2004-08-03 15:57 Logged In: YES |
|
Date: 2004-07-28 20:19 Logged In: YES |
|
Date: 2004-07-19 21:53 Logged In: YES |
|
Date: 2004-07-19 17:40 Logged In: YES |
|
Date: 2004-07-13 13:55 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| close_date | - | 2004-07-28 20:19 | stack-sf |
| status_id | Open | 2004-07-28 20:18 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use