Re: [Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?
Brought to you by:
derrickoswald
|
From: Jan H. <jan...@gm...> - 2006-10-17 20:06:09
|
Derrick,
thanks for Your advise! I tried setEncoding(), but then I get
ParserExceptions.
I also tried my code with all kinds of public pages, all using
ISO-8859-1, but whenever they have characters specific to ISO-8859-1 (as
I mentioned, for example the lower and upper quotation marks) I have
problems.
I debugged the code in eclipse. This is how I retrieve the link-text:
<code>
LinkTag linkTag = (LinkTag) linkNode;
String linktext = linkTag.getLinkText();
</code>
The method "linkTag.getLinkText()()" returns the text with little
"boxes" instad of the quotation-marks (I can't put them in this
plain-text mail).
So it seems like the getLinkText() method does return these characters
wrongly encoded?
Thanks and regards
Jan Hempel
Derrick Oswald wrote:
> Jan,
>
> It may be that the site is lying (in the HTTP header or even in the META
> tag of the page) and it really is in another encoding - maybe UTF-8 already.
> Try setEncoding() on the Parser before asking for nodes or filtering.
>
> Derrick
>
> Jan Hempel wrote:
>
>> Hi guys,
>>
>> I'm trying to parse a website which is encoded in ISO-8859-1. I need to
>> store extracted link-texts in UTF-8 format.
>>
>> My code looks like this:
>>
>> <code>
>> Parser myParser = new Parser();
>> myParser.setURL(url);
>>
>> // I created a filter named "myLinkFilter" which filters LinkNodes
>> NodeList myLinkNodeList = myParser.parse(myLinkFilter);
>>
>> Node myLinkNode = myLinkNodeList.elementAt(0);
>>
>> LinkTag linkTag = (LinkTag) myLinkNode;
>>
>> String linkText = linkTag.getLinkText();
>> </code>
>>
>> The problem now is, that certain characters (like the lower quotation
>> marks: „Quote“) are converted to question marks.
>>
>> So I tried a coding like this:
>>
>> <code>
>>
>> String isoString = linkTag.getLinkText();
>> String utf8String = null;
>>
>> try
>> {
>> byte[] stringBytesISO = isoString.getBytes("ISO-8859-1");
>> utf8String = new String(stringBytesISO, "UTF-8");
>> }
>> catch (UnsupportedEncodingException e)
>> {
>> // do something...
>> }
>> </code>
>>
>> But this still returns question marks in the utf8String.
>> Any ideas what I need to change?
>>
>> Thanks and regards
>> Jan Hempel
>>
>>
>>
>>
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>>
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
|