[Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?
Brought to you by:
derrickoswald
|
From: Jan H. <jan...@gm...> - 2006-10-15 20:47:02
|
Hi guys,
I'm trying to parse a website which is encoded in ISO-8859-1. I need to
store extracted link-texts in UTF-8 format.
My code looks like this:
<code>
Parser myParser = new Parser();
myParser.setURL(url);
// I created a filter named "myLinkFilter" which filters LinkNodes
NodeList myLinkNodeList = myParser.parse(myLinkFilter);
Node myLinkNode = myLinkNodeList.elementAt(0);
LinkTag linkTag = (LinkTag) myLinkNode;
String linkText = linkTag.getLinkText();
</code>
The problem now is, that certain characters (like the lower quotation
marks: „Quote“) are converted to question marks.
So I tried a coding like this:
<code>
String isoString = linkTag.getLinkText();
String utf8String = null;
try
{
byte[] stringBytesISO = isoString.getBytes("ISO-8859-1");
utf8String = new String(stringBytesISO, "UTF-8");
}
catch (UnsupportedEncodingException e)
{
// do something...
}
</code>
But this still returns question marks in the utf8String.
Any ideas what I need to change?
Thanks and regards
Jan Hempel
|