Re: [Htmlparser-developer] How parse HTML in spanish?
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2002-12-16 12:51:06
|
Juan, I don't see that particular string anymore at http://www.elmundo.es, but another instance is: El miércoles tendrá lugar el estreno mundial de la segunda entrega de 'El señor de los anillos'. which has the both the \u00e1 and \u00f1 characters printing correctly. I also tried the jar file directly from the release candidate /htmlparser/htmlparser1_2_20021215.zip and it also correctly prints those characters. You may be using an old jar file. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 23:34:18 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Yes, there seems to be a problem. > > The openURLConnection() method of HTMLParser uses "8859_4" encoding > which presumably maps to iso-8859-4. > > According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) > the default charset should be iso-8859-1 (section 3.7.1). > The content of http://www.elmundo.es is indeed content="text/html; > charset=iso-8859-1" and should be interpreted that way. > > From what I can find, the 8859-4 is an extension of 8859-1 for > Lithuanian and Latvian characters, and is superceded by 8859-10. > e.g. see the Linux man pages for charsets > (http://nodevice.com/sections/ManIndex/man0132.html) > 8859-4 (Latin-4) > Latin-4 introduced letters for Estonian, Latvian, and > Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). > > So I have changed the default to" 8859_1". > > I would also make sure that you are able to see the correct glyphs by > running this: > > public class Test > { > public static void main (String[] args) > { > System.out.println ("El m\u00e1s famoso y caro vino de pago de > Espa\u00f1a"); > } > } > > Going forward, it would be good for HTMLParser to honour the charset > property on the "Content-Type" field in the HTML header. But at that > point the InputStream from the URLConnection is already partially > consumed by the parser and a switch of character set may be problematic. > It's not clear when the character set is supposed to take effect (within > ?) but this may be a good reason to use the mark() and > reset() on the Reader. > > Derrick > > agente007 wrote: > > > > > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > > From: Derrick Oswald [mailto: Der...@ro...] > > To: htm...@li... > > Date: Fri, 13 Dec 2002 08:22:00 -0500 > > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > > > Can you be more specific about what isn't being extracted correctly? > > The best way would be to make a test case that shows the problem and > > submit it as a bug. > > > > For example, when I try a URL as : "http://www.elmundo.es" then > > appears the text: ... Text = El mßs famoso y caro vino de pago de > > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > > propia ... The correct text would be: ... Text = El más famoso y caro > > vino de pago de España, el Pingus, no podrá acceder a una denominación > > de origen propia ... what happend? Juan J > > ------------------------------------------------------------------------ > > Join Excite! - http://www.excite.com > > The most personalized portal on the Web! > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > Do not work! The parser and the example write "El mßs famoso y caro > vino de pago de Espa?a" Should be "El más famoso y caro vino de pago > de España". See the character á and the character ñ. (\u00e1 and > \u00f1") Regards Juan J. Samper > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |