Re: [Htmlparser-developer] How parse HTML in spanish?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

 --- On Fri 12/13, Derrick Oswald  wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html)       8859-4 (Latin-4)              Latin-4 introduced letters for Estonian, Latvian, and Lithuanian.  It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{    public static void main (String[] args)    {        System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a");    }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within   ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald  wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Do not work!

The parser and the example write "El mßs famoso y caro vino de pago de Espa?a"

Should be "El más famoso y caro vino de pago de España".

See the character á and the character ñ.

(\u00e1 and \u00f1")

Regards

Juan J. Samper

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!