Re: [Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Derrick,

thanks for Your advise! I tried setEncoding(), but then I get 
ParserExceptions.

I also tried my code with all kinds of public pages, all using 
ISO-8859-1, but whenever they have characters specific to ISO-8859-1 (as 
I mentioned, for example the lower and upper quotation marks) I have 
problems.

I debugged the code in eclipse. This is how I retrieve the link-text:

<code>

LinkTag linkTag = (LinkTag) linkNode;
String linktext = linkTag.getLinkText();

</code>

The method "linkTag.getLinkText()()" returns the text with little 
"boxes" instad of the quotation-marks (I can't put them in this 
plain-text mail).

So it seems like the getLinkText() method does return these characters 
wrongly encoded?

Thanks and regards
Jan Hempel

Derrick Oswald wrote:
> Jan,
> 
> It may be that the site is lying (in the HTTP header or even in the META 
> tag of the page) and it really is in another encoding - maybe UTF-8 already.
> Try setEncoding() on the Parser before asking for nodes or filtering.
> 
> Derrick
> 
> Jan Hempel wrote:
> 
>> Hi guys,
>>
>> I'm trying to parse a website which is encoded in ISO-8859-1. I need to 
>> store extracted link-texts in UTF-8 format.
>>
>> My code looks like this:
>>
>> <code>
>> Parser myParser = new Parser();
>> myParser.setURL(url);
>>
>> // I created a filter named "myLinkFilter" which filters LinkNodes
>> NodeList myLinkNodeList = myParser.parse(myLinkFilter);
>>
>> Node myLinkNode = myLinkNodeList.elementAt(0);
>>
>> LinkTag linkTag = (LinkTag) myLinkNode;
>>
>> String linkText = linkTag.getLinkText();
>> </code>
>>
>> The problem now is, that certain characters (like the lower quotation 
>> marks: „Quote“) are converted to question marks.
>>
>> So I tried a coding like this:
>>
>> <code>
>>
>> String isoString = linkTag.getLinkText();
>> String utf8String = null;
>>
>> try
>> {
>>    byte[] stringBytesISO = isoString.getBytes("ISO-8859-1");
>>    utf8String = new String(stringBytesISO, "UTF-8");
>> }
>> catch (UnsupportedEncodingException e)
>> {
>>    // do something...
>> }
>> </code>
>>
>> But this still returns question marks in the utf8String.
>> Any ideas what I need to change?
>>
>> Thanks and regards
>> Jan Hempel
>>
>>
>>
>>
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>  
>>
> 
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>