Menu

#264 Charset windows-1252 problem

v1.6
open
5
2008-11-19
2008-11-19
No

When I try to parse a page with charset "windows-1252", if the page contains some special characters, I get different results depending on the constructor I use to create the Parser.

The pages attached in the example contains the characters "left double quote" (“) or "right double quote" (”)

If I read the page into a String and create a Parser using the constructor "Parser(String resource)", the special characters are interpreted correctly, and the string (title of the page in this case) returned by parsing is correct. Instead, if I use the constructor with a URLConnection or create a Lexer with an InputStream, the title returned contains the charachters like “ instead of the left double quote.

It seems for me that it boils down to the StringSource and the InputStreamSource. Whatever constructor uses the StringSource internally, it's parsed correctly. If the Parser uses the InputStreamSource, it's parsed incorrectly.

Also, if I set the charset in the html to utf-8, all parsers work, both those that use StringSource as well as those that use InputStreamSource.

I have attached a zip-file containing a JUnit test-case as well as two simple htm-files, one with windows-1252 charset and the other one with utf-8.

Discussion

  • Eivind Roennevik

    Parser Test Case

     
  • Govind Avireddi

    Govind Avireddi - 2011-08-29

    I could parse HTML document with charset=windows-1252. The problem with unknown charset seems to be due to support of java.io.InputStream.InputStreamReader function. I found information about supported encodings at http://download.oracle.com/javase/1.4.2/docs/guide/intl/encoding.doc.html

     

Log in to post a comment.