Re: [Htmlparser-developer] How parse HTML in spanish?
Brought to you by:
derrickoswald
From: agente007 <e-a...@ex...> - 2002-12-16 12:16:22
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{ public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-developer Do not work! The parser and the example write "El mßs famoso y caro vino de pago de Espa?a" Should be "El más famoso y caro vino de pago de España". See the character á and the character ñ. (\u00e1 and \u00f1") Regards Juan J. Samper _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |