Thread: [Htmlparser-user] Parsing ISO-8859-1 --> storing in UTF-8?
Brought to you by:
derrickoswald
From: Jan H. <jan...@gm...> - 2006-10-15 20:47:02
|
Hi guys, I'm trying to parse a website which is encoded in ISO-8859-1. I need to store extracted link-texts in UTF-8 format. My code looks like this: <code> Parser myParser = new Parser(); myParser.setURL(url); // I created a filter named "myLinkFilter" which filters LinkNodes NodeList myLinkNodeList = myParser.parse(myLinkFilter); Node myLinkNode = myLinkNodeList.elementAt(0); LinkTag linkTag = (LinkTag) myLinkNode; String linkText = linkTag.getLinkText(); </code> The problem now is, that certain characters (like the lower quotation marks: „Quote“) are converted to question marks. So I tried a coding like this: <code> String isoString = linkTag.getLinkText(); String utf8String = null; try { byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); utf8String = new String(stringBytesISO, "UTF-8"); } catch (UnsupportedEncodingException e) { // do something... } </code> But this still returns question marks in the utf8String. Any ideas what I need to change? Thanks and regards Jan Hempel |
From: Derrick O. <Der...@Ro...> - 2006-10-16 12:13:46
|
Jan, It may be that the site is lying (in the HTTP header or even in the META tag of the page) and it really is in another encoding - maybe UTF-8 already. Try setEncoding() on the Parser before asking for nodes or filtering. Derrick Jan Hempel wrote: >Hi guys, > >I'm trying to parse a website which is encoded in ISO-8859-1. I need to >store extracted link-texts in UTF-8 format. > >My code looks like this: > ><code> >Parser myParser = new Parser(); >myParser.setURL(url); > >// I created a filter named "myLinkFilter" which filters LinkNodes >NodeList myLinkNodeList = myParser.parse(myLinkFilter); > >Node myLinkNode = myLinkNodeList.elementAt(0); > >LinkTag linkTag = (LinkTag) myLinkNode; > >String linkText = linkTag.getLinkText(); ></code> > >The problem now is, that certain characters (like the lower quotation >marks: „Quote“) are converted to question marks. > >So I tried a coding like this: > ><code> > >String isoString = linkTag.getLinkText(); >String utf8String = null; > >try >{ > byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); > utf8String = new String(stringBytesISO, "UTF-8"); >} >catch (UnsupportedEncodingException e) >{ > // do something... >} ></code> > >But this still returns question marks in the utf8String. >Any ideas what I need to change? > >Thanks and regards >Jan Hempel > > > > > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jan H. <jan...@gm...> - 2006-10-17 20:06:09
|
Derrick, thanks for Your advise! I tried setEncoding(), but then I get ParserExceptions. I also tried my code with all kinds of public pages, all using ISO-8859-1, but whenever they have characters specific to ISO-8859-1 (as I mentioned, for example the lower and upper quotation marks) I have problems. I debugged the code in eclipse. This is how I retrieve the link-text: <code> LinkTag linkTag = (LinkTag) linkNode; String linktext = linkTag.getLinkText(); </code> The method "linkTag.getLinkText()()" returns the text with little "boxes" instad of the quotation-marks (I can't put them in this plain-text mail). So it seems like the getLinkText() method does return these characters wrongly encoded? Thanks and regards Jan Hempel Derrick Oswald wrote: > Jan, > > It may be that the site is lying (in the HTTP header or even in the META > tag of the page) and it really is in another encoding - maybe UTF-8 already. > Try setEncoding() on the Parser before asking for nodes or filtering. > > Derrick > > Jan Hempel wrote: > >> Hi guys, >> >> I'm trying to parse a website which is encoded in ISO-8859-1. I need to >> store extracted link-texts in UTF-8 format. >> >> My code looks like this: >> >> <code> >> Parser myParser = new Parser(); >> myParser.setURL(url); >> >> // I created a filter named "myLinkFilter" which filters LinkNodes >> NodeList myLinkNodeList = myParser.parse(myLinkFilter); >> >> Node myLinkNode = myLinkNodeList.elementAt(0); >> >> LinkTag linkTag = (LinkTag) myLinkNode; >> >> String linkText = linkTag.getLinkText(); >> </code> >> >> The problem now is, that certain characters (like the lower quotation >> marks: „Quote“) are converted to question marks. >> >> So I tried a coding like this: >> >> <code> >> >> String isoString = linkTag.getLinkText(); >> String utf8String = null; >> >> try >> { >> byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); >> utf8String = new String(stringBytesISO, "UTF-8"); >> } >> catch (UnsupportedEncodingException e) >> { >> // do something... >> } >> </code> >> >> But this still returns question marks in the utf8String. >> Any ideas what I need to change? >> >> Thanks and regards >> Jan Hempel >> >> >> >> >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |