Re: [Htmlparser-user] Encoding problem when parsing html
Brought to you by:
derrickoswald
From: MitchH <m2...@mi...> - 2007-10-12 14:17:52
|
Martin Sturm <msturm10 <at> gmail.com> writes: > > Hello, > > I'm using HTMLParser for extracting text from a HTML page in order to > index it using a full text search engine. > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. > This can be handled by catching the EncodingChangeException, but this > doesn't prevent the textual content of the site interpreted > incorrectly. > The microsoft site contains the following snippet: <head><META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> It seems they change the content encoding just for the title (god knows why) The second change, back to utf8 causes things to fall over. I found fix at http://osdir.com/ml/parsers.htmlparser.user/2006-03/msg00033.html |