I wrote codes shown below, but the parser tried to read it as Shift-Jis encoded ,not windows-31j (MS932) encoded.
So some japanese characters such as U+FF0D are garbled.
Parser parser = new Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm");
parser.setEncoding("windows-31j");
parser.visitAllNodesWith(visitor);
How can I have the parser use windows-31j decorder?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) The http server is sending the Shift-JIS encoding directive in the http response header. This can be fixed by saving the URL to disk before parsing with the code you have.
2) The META tag in the header specifies Shift-JIS. This appears to be the case (I see this tag:
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
in the header). This can't be fixed, except by editing the file and fixing the erroneous tag, since the parser will switch encoding at that point.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, Derrick.
I've saved it to local file and parsed , but same results were got. Cause of "2)" you told.
As you told, the response header and meta tag says "it's encoded with ShiftJIS" against its true meanings defined by IANA.
But, not many browser cannot interpret "windows-31j" so,they(including all Japanese) cannot do anything but to say "it's encoded with ShiftJIS" about HTML.
Anyway, I think it's so helpful if the #setEncoding() accepts a parameter to "force" it and neglect meta tag/http-response.
Or, can I do some workaround about this problem?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
org/htmlparser/Parser.java: a new function
/**
* This overloaded function accepts a parameter whether it forces
* the parser to use it ,in other word, forbids to override
* by HTTP header or meta tags.
* @param encoding The new character set to use.
* @param forces If <code>true</code>, the parser is forced to use it.
* @see #setEncoding(String).
*/
public void setEncoding (String encoding, boolean forces)
throws
ParserException
{
getLexer ().getPage ().setEncoding (encoding, forces);
}
org/htmlparser/lexer/Page.java: new functions and a property
/**
* Set to true when the parser was forced to use a designated encoding.
*/
private boolean mEncodingForced = false;
/**
* Get whether the parser was forced to use a designated encoding.
*/
public boolean isEncodingForced()
{
return mEncodingForced;
}
/**
* This overloaded function accepts a parameter whether it forces
* the parser to use it ,in other word, forbids to override
* by HTTP header or meta tags.
* @param character_set same as #setEncoding(String).
* @param forces If <code>true</code>, the parser is forced to use it.
* @exception ParserException same as #setEncoding(String).
* @see #setEncoding(String).
*/
org/htmlparser/tags/MetaTag.java: a change
From:
if ("Content-Type".equalsIgnoreCase (httpEquiv) )
{
To:
if ("Content-Type".equalsIgnoreCase (httpEquiv)
&& !getPage ().isEncodingForced())
{
Usage of this feature:
Parser parser = new Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm");
parser.setEncoding("windows-31j");
parser.visitAllNodesWith(visitor);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ouch, the example is wrong...
parser.setEncoding("windows-31j", true);
^^^^^
>Usage of this feature:
>Parser parser = new >Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm";);
>parser.setEncoding("windows-31j");
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wrote codes shown below, but the parser tried to read it as Shift-Jis encoded ,not windows-31j (MS932) encoded.
So some japanese characters such as U+FF0D are garbled.
Parser parser = new Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm");
parser.setEncoding("windows-31j");
parser.visitAllNodesWith(visitor);
How can I have the parser use windows-31j decorder?
There could be two things going wrong.
1) The http server is sending the Shift-JIS encoding directive in the http response header. This can be fixed by saving the URL to disk before parsing with the code you have.
2) The META tag in the header specifies Shift-JIS. This appears to be the case (I see this tag:
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
in the header). This can't be fixed, except by editing the file and fixing the erroneous tag, since the parser will switch encoding at that point.
Thank you, Derrick.
I've saved it to local file and parsed , but same results were got. Cause of "2)" you told.
As you told, the response header and meta tag says "it's encoded with ShiftJIS" against its true meanings defined by IANA.
But, not many browser cannot interpret "windows-31j" so,they(including all Japanese) cannot do anything but to say "it's encoded with ShiftJIS" about HTML.
Please see this about that problem
http://bugs.sun.com/bugdatabase/view_bug.do;:WuuT?bug_id=4556882
Sun hates "de facto" so much, but.....
Anyway, I think it's so helpful if the #setEncoding() accepts a parameter to "force" it and neglect meta tag/http-response.
Or, can I do some workaround about this problem?
Yes, you could add a 'force' parameter to setEncoding().
Submit it as a patch when you have it done.
ok, I'll do it.
>Yes, you could add a 'force' parameter to setEncoding().
>Submit it as a patch when you have it done.
I've done it. How is this?
org/htmlparser/Parser.java: a new function
/**
* This overloaded function accepts a parameter whether it forces
* the parser to use it ,in other word, forbids to override
* by HTTP header or meta tags.
* @param encoding The new character set to use.
* @param forces If <code>true</code>, the parser is forced to use it.
* @see #setEncoding(String).
*/
public void setEncoding (String encoding, boolean forces)
throws
ParserException
{
getLexer ().getPage ().setEncoding (encoding, forces);
}
org/htmlparser/lexer/Page.java: new functions and a property
/**
* Set to true when the parser was forced to use a designated encoding.
*/
private boolean mEncodingForced = false;
/**
* Get whether the parser was forced to use a designated encoding.
*/
public boolean isEncodingForced()
{
return mEncodingForced;
}
/**
* This overloaded function accepts a parameter whether it forces
* the parser to use it ,in other word, forbids to override
* by HTTP header or meta tags.
* @param character_set same as #setEncoding(String).
* @param forces If <code>true</code>, the parser is forced to use it.
* @exception ParserException same as #setEncoding(String).
* @see #setEncoding(String).
*/
public void setEncoding (String character_set, boolean forces)
throws
ParserException
{
mEncodingForced = forces;
getSource ().setEncoding (character_set);
}
org/htmlparser/tags/MetaTag.java: a change
From:
if ("Content-Type".equalsIgnoreCase (httpEquiv) )
{
To:
if ("Content-Type".equalsIgnoreCase (httpEquiv)
&& !getPage ().isEncodingForced())
{
Usage of this feature:
Parser parser = new Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm");
parser.setEncoding("windows-31j");
parser.visitAllNodesWith(visitor);
ouch, the example is wrong...
parser.setEncoding("windows-31j", true);
^^^^^
>Usage of this feature:
>Parser parser = new >Parser("http://kyushu.yomiuri.co.jp/entame/ramen/04ra/ramen040108.htm";);
>parser.setEncoding("windows-31j");