Re: [Htmlparser-developer] How parse HTML in spanish?
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2002-12-14 04:27:21
|
Yes, there seems to be a problem. The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4. According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1). The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10. e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). So I have changed the default to" 8859_1". I would also make sure that you are able to see the correct glyphs by running this: public class Test { public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); } } Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within <BODY> </BODY> ?) but this may be a good reason to use the mark() and reset() on the Reader. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 08:22:00 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Can you be more specific about what isn't being extracted correctly? > The best way would be to make a test case that shows the problem and > submit it as a bug. > > For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |