Thread: Re: [Htmlparser-developer] How parse HTML in spanish?
Brought to you by:
derrickoswald
From: agente007 <e-a...@ex...> - 2002-12-13 22:20:10
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 08:22:00 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Can you be more specific about what isn't being extracted correctly?The best way would be to make a test case that shows the problem and submit it as a bug. For example, when I try a URL as : "http://www.elmundo.es" then appears the text: ... Text = El mßs famoso y caro vino de pago de Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen propia ... The correct text would be: ... Text = El más famoso y caro vino de pago de España, el Pingus, no podrá acceder a una denominación de origen propia ... what happend? Juan J _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: agente007 <e-a...@ex...> - 2002-12-14 13:12:48
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{ public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-developer Thanks! I will prove it. Juan J _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: agente007 <e-a...@ex...> - 2002-12-16 12:16:22
|
--- On Fri 12/13, Derrick Oswald wrote:From: Derrick Oswald [mailto: Der...@ro...]To: htm...@li...Date: Fri, 13 Dec 2002 23:34:18 -0500Subject: Re: [Htmlparser-developer] How parse HTML in spanish?Yes, there seems to be a problem.The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4.According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1).The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10.e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6).So I have changed the default to" 8859_1".I would also make sure that you are able to see the correct glyphs by running this:public class Test{ public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); }}Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within ?) but this may be a good reason to use the mark() and reset() on the Reader.Derrickagente007 wrote:>>> --- On Fri 12/13, Derrick Oswald wrote:> From: Derrick Oswald [mailto: Der...@ro...]> To: htm...@li...> Date: Fri, 13 Dec 2002 08:22:00 -0500> Subject: Re: [Htmlparser-developer] How parse HTML in spanish?>> Can you be more specific about what isn't being extracted correctly?> The best way would be to make a test case that shows the problem and> submit it as a bug.>> For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J> ------------------------------------------------------------------------> Join Excite! - http://www.excite.com> The most personalized portal on the Web! -------------------------------------------------------This sf.net email is sponsored by:With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channelhttp://hpc.devchannel.org/_______________________________________________Htmlparser-developer mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-developer Do not work! The parser and the example write "El mßs famoso y caro vino de pago de Espa?a" Should be "El más famoso y caro vino de pago de España". See the character á and the character ñ. (\u00e1 and \u00f1") Regards Juan J. Samper _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Derrick O. <Der...@ro...> - 2002-12-16 12:51:06
|
Juan, I don't see that particular string anymore at http://www.elmundo.es, but another instance is: El miércoles tendrá lugar el estreno mundial de la segunda entrega de 'El señor de los anillos'. which has the both the \u00e1 and \u00f1 characters printing correctly. I also tried the jar file directly from the release candidate /htmlparser/htmlparser1_2_20021215.zip and it also correctly prints those characters. You may be using an old jar file. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 23:34:18 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Yes, there seems to be a problem. > > The openURLConnection() method of HTMLParser uses "8859_4" encoding > which presumably maps to iso-8859-4. > > According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) > the default charset should be iso-8859-1 (section 3.7.1). > The content of http://www.elmundo.es is indeed content="text/html; > charset=iso-8859-1" and should be interpreted that way. > > From what I can find, the 8859-4 is an extension of 8859-1 for > Lithuanian and Latvian characters, and is superceded by 8859-10. > e.g. see the Linux man pages for charsets > (http://nodevice.com/sections/ManIndex/man0132.html) > 8859-4 (Latin-4) > Latin-4 introduced letters for Estonian, Latvian, and > Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). > > So I have changed the default to" 8859_1". > > I would also make sure that you are able to see the correct glyphs by > running this: > > public class Test > { > public static void main (String[] args) > { > System.out.println ("El m\u00e1s famoso y caro vino de pago de > Espa\u00f1a"); > } > } > > Going forward, it would be good for HTMLParser to honour the charset > property on the "Content-Type" field in the HTML header. But at that > point the InputStream from the URLConnection is already partially > consumed by the parser and a switch of character set may be problematic. > It's not clear when the character set is supposed to take effect (within > ?) but this may be a good reason to use the mark() and > reset() on the Reader. > > Derrick > > agente007 wrote: > > > > > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > > From: Derrick Oswald [mailto: Der...@ro...] > > To: htm...@li... > > Date: Fri, 13 Dec 2002 08:22:00 -0500 > > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > > > Can you be more specific about what isn't being extracted correctly? > > The best way would be to make a test case that shows the problem and > > submit it as a bug. > > > > For example, when I try a URL as : "http://www.elmundo.es" then > > appears the text: ... Text = El mßs famoso y caro vino de pago de > > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > > propia ... The correct text would be: ... Text = El más famoso y caro > > vino de pago de España, el Pingus, no podrá acceder a una denominación > > de origen propia ... what happend? Juan J > > ------------------------------------------------------------------------ > > Join Excite! - http://www.excite.com > > The most personalized portal on the Web! > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > Do not work! The parser and the example write "El mßs famoso y caro > vino de pago de Espa?a" Should be "El más famoso y caro vino de pago > de España". See the character á and the character ñ. (\u00e1 and > \u00f1") Regards Juan J. Samper > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |
From: Derrick O. <Der...@ro...> - 2002-12-14 04:27:21
|
Yes, there seems to be a problem. The openURLConnection() method of HTMLParser uses "8859_4" encoding which presumably maps to iso-8859-4. According to RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) the default charset should be iso-8859-1 (section 3.7.1). The content of http://www.elmundo.es is indeed content="text/html; charset=iso-8859-1" and should be interpreted that way. From what I can find, the 8859-4 is an extension of 8859-1 for Lithuanian and Latvian characters, and is superceded by 8859-10. e.g. see the Linux man pages for charsets (http://nodevice.com/sections/ManIndex/man0132.html) 8859-4 (Latin-4) Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It is essentially obsolete; see 8859-10 (Latin-6). So I have changed the default to" 8859_1". I would also make sure that you are able to see the correct glyphs by running this: public class Test { public static void main (String[] args) { System.out.println ("El m\u00e1s famoso y caro vino de pago de Espa\u00f1a"); } } Going forward, it would be good for HTMLParser to honour the charset property on the "Content-Type" field in the HTML header. But at that point the InputStream from the URLConnection is already partially consumed by the parser and a switch of character set may be problematic. It's not clear when the character set is supposed to take effect (within <BODY> </BODY> ?) but this may be a good reason to use the mark() and reset() on the Reader. Derrick agente007 wrote: > > > --- On Fri 12/13, Derrick Oswald < Der...@ro... > wrote: > From: Derrick Oswald [mailto: Der...@ro...] > To: htm...@li... > Date: Fri, 13 Dec 2002 08:22:00 -0500 > Subject: Re: [Htmlparser-developer] How parse HTML in spanish? > > Can you be more specific about what isn't being extracted correctly? > The best way would be to make a test case that shows the problem and > submit it as a bug. > > For example, when I try a URL as : "http://www.elmundo.es" then > appears the text: ... Text = El mßs famoso y caro vino de pago de > Espa?a, el Pingus, no podrß acceder a una denominaci?n de origen > propia ... The correct text would be: ... Text = El más famoso y caro > vino de pago de España, el Pingus, no podrá acceder a una denominación > de origen propia ... what happend? Juan J > ------------------------------------------------------------------------ > Join Excite! - http://www.excite.com > The most personalized portal on the Web! |