Re: [Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2006-10-16 12:13:46
|
Jan,
It may be that the site is lying (in the HTTP header or even in the META
tag of the page) and it really is in another encoding - maybe UTF-8 already.
Try setEncoding() on the Parser before asking for nodes or filtering.
Derrick
Jan Hempel wrote:
>Hi guys,
>
>I'm trying to parse a website which is encoded in ISO-8859-1. I need to
>store extracted link-texts in UTF-8 format.
>
>My code looks like this:
>
><code>
>Parser myParser = new Parser();
>myParser.setURL(url);
>
>// I created a filter named "myLinkFilter" which filters LinkNodes
>NodeList myLinkNodeList = myParser.parse(myLinkFilter);
>
>Node myLinkNode = myLinkNodeList.elementAt(0);
>
>LinkTag linkTag = (LinkTag) myLinkNode;
>
>String linkText = linkTag.getLinkText();
></code>
>
>The problem now is, that certain characters (like the lower quotation
>marks: „Quote“) are converted to question marks.
>
>So I tried a coding like this:
>
><code>
>
>String isoString = linkTag.getLinkText();
>String utf8String = null;
>
>try
>{
> byte[] stringBytesISO = isoString.getBytes("ISO-8859-1");
> utf8String = new String(stringBytesISO, "UTF-8");
>}
>catch (UnsupportedEncodingException e)
>{
> // do something...
>}
></code>
>
>But this still returns question marks in the utf8String.
>Any ideas what I need to change?
>
>Thanks and regards
>Jan Hempel
>
>
>
>
>
>-------------------------------------------------------------------------
>Using Tomcat but need to do more? Need to support web services, security?
>Get stuff done quickly with pre-integrated technology to make your job easier
>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|