HTML Parser / Bugs / #264 Charset windows-1252 problem

#264 Charset windows-1252 problem

Milestone: v1.6

Status: open

Owner: Derrick Oswald

Labels: Charset Encoding (24)

Priority: 5

Updated: 2008-11-19

Created: 2008-11-19

Creator: Eivind Roennevik

Private: No

When I try to parse a page with charset "windows-1252", if the page contains some special characters, I get different results depending on the constructor I use to create the Parser.

The pages attached in the example contains the characters "left double quote" (“) or "right double quote" (”)

If I read the page into a String and create a Parser using the constructor "Parser(String resource)", the special characters are interpreted correctly, and the string (title of the page in this case) returned by parsing is correct. Instead, if I use the constructor with a URLConnection or create a Lexer with an InputStream, the title returned contains the charachters like â€œ instead of the left double quote.

It seems for me that it boils down to the StringSource and the InputStreamSource. Whatever constructor uses the StringSource internally, it's parsed correctly. If the Parser uses the InputStreamSource, it's parsed incorrectly.

Also, if I set the charset in the html to utf-8, all parsers work, both those that use StringSource as well as those that use InputStreamSource.

I have attached a zip-file containing a JUnit test-case as well as two simple htm-files, one with windows-1252 charset and the other one with utf-8.

Discussion

Eivind Roennevik - 2008-11-19

Parser Test Case

ParserTestCase.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Govind Avireddi - 2011-08-29

I could parse HTML document with charset=windows-1252. The problem with unknown charset seems to be due to support of java.io.InputStream.InputStreamReader function. I found information about supported encodings at http://download.oracle.com/javase/1.4.2/docs/guide/intl/encoding.doc.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charset windows-1252 problem

Group

Searches

Help

#264 Charset windows-1252 problem

Discussion