Re: [Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?
Brought to you by:
derrickoswald
From: Jan H. <jan...@gm...> - 2006-10-17 20:06:09
|
Derrick, thanks for Your advise! I tried setEncoding(), but then I get ParserExceptions. I also tried my code with all kinds of public pages, all using ISO-8859-1, but whenever they have characters specific to ISO-8859-1 (as I mentioned, for example the lower and upper quotation marks) I have problems. I debugged the code in eclipse. This is how I retrieve the link-text: <code> LinkTag linkTag = (LinkTag) linkNode; String linktext = linkTag.getLinkText(); </code> The method "linkTag.getLinkText()()" returns the text with little "boxes" instad of the quotation-marks (I can't put them in this plain-text mail). So it seems like the getLinkText() method does return these characters wrongly encoded? Thanks and regards Jan Hempel Derrick Oswald wrote: > Jan, > > It may be that the site is lying (in the HTTP header or even in the META > tag of the page) and it really is in another encoding - maybe UTF-8 already. > Try setEncoding() on the Parser before asking for nodes or filtering. > > Derrick > > Jan Hempel wrote: > >> Hi guys, >> >> I'm trying to parse a website which is encoded in ISO-8859-1. I need to >> store extracted link-texts in UTF-8 format. >> >> My code looks like this: >> >> <code> >> Parser myParser = new Parser(); >> myParser.setURL(url); >> >> // I created a filter named "myLinkFilter" which filters LinkNodes >> NodeList myLinkNodeList = myParser.parse(myLinkFilter); >> >> Node myLinkNode = myLinkNodeList.elementAt(0); >> >> LinkTag linkTag = (LinkTag) myLinkNode; >> >> String linkText = linkTag.getLinkText(); >> </code> >> >> The problem now is, that certain characters (like the lower quotation >> marks: „Quote“) are converted to question marks. >> >> So I tried a coding like this: >> >> <code> >> >> String isoString = linkTag.getLinkText(); >> String utf8String = null; >> >> try >> { >> byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); >> utf8String = new String(stringBytesISO, "UTF-8"); >> } >> catch (UnsupportedEncodingException e) >> { >> // do something... >> } >> </code> >> >> But this still returns question marks in the utf8String. >> Any ideas what I need to change? >> >> Thanks and regards >> Jan Hempel >> >> >> >> >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |