#78 page with charset set to utf-16 not decoded correctly

General
closed-rejected
nobody
None
5
2014-08-18
2014-06-27
No

http://www.sonos.com/hometheater is not being decoded. It is quite possibly because the charset is incorrectly set:

<meta http-equiv="Content-Type" content="text/html" charset="utf-16" />

But this seems to to a utf-8 page.

The browser decodes the page fine, however. Is it possible for jericho to detect that the specified encoding is incorrect and try utf-8?

Related

Bugs: #78

Discussion

  • Martin Jericho
    Martin Jericho
    2014-06-28

    If passing the URL directly to the Source constructor the document is parsed correctly, as the server sends the correct encoding in the Content-Type header.

    Try running the following console sample application to verify:

    Encoding.bat http://www.sonos.com/hometheater

    So apparently you are using a different mechanism to load the document, then passing the raw input stream to the Source constructor. When you're loading the content you could save the encoding specified by the web server and pass it on to the Source constructor.

    It would be possible for the parser to recognise that the document specified encoding is not compatible with the preliminary encoding and reject it, but in this case it would still not be able to detect the correct UTF-8 encoding. There are dedicated libraries for encoding detection if you need something with more accuracy. The main purpose of this library is to parse HTML, not detect encoding, although it does a pretty good job of the latter. It looks like you don't need an external library though if you just use the encoding specified by the web server.

    Cheers
    Martin

     
    • Thanks. I first save the HTML and call Jericho with the file. I will try
      out your suggestions.
      On Jun 27, 2014 6:34 PM, "Martin Jericho" mjericho@users.sf.net wrote:

      If passing the URL directly to the Source constructor the document is
      parsed correctly, as the server sends the correct encoding in the
      Content-Type header.

      Try running the following console sample application to verify:

      Encoding.bat http://www.sonos.com/hometheater

      So apparently you are using a different mechanism to load the document,
      then passing the raw input stream to the Source constructor. When you're
      loading the content you could save the encoding specified by the web server
      and pass it on to the Source constructor.

      It would be possible for the parser to recognise that the document
      specified encoding is not compatible with the preliminary encoding and
      reject it, but in this case it would still not be able to detect the
      correct UTF-8 encoding. There are dedicated libraries for encoding
      detection if you need something with more accuracy. The main purpose of
      this library is to parse HTML, not detect encoding, although it does a
      pretty good job of the latter. It looks like you don't need an external
      library though if you just use the encoding specified by the web server.

      Cheers
      Martin


      Status: unread
      Group: General
      Created: Fri Jun 27, 2014 09:29 PM UTC by thushara wijeratna
      Last Updated: Fri Jun 27, 2014 09:29 PM UTC
      Owner: nobody

      http://www.sonos.com/hometheater is not being decoded. It is quite
      possibly because the charset is incorrectly set:

      <meta http-equiv="Content-Type" content="text/html" charset="utf-16"/>

      But this seems to to a utf-8 page.

      The browser decodes the page fine, however. Is it possible for jericho to
      detect that the specified encoding is incorrect and try utf-8?


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/jerichohtml/bugs/78/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #78

      Attachments
  • Martin Jericho
    Martin Jericho
    2014-06-28

    • status: unread --> closed-rejected
     
    • Hi Martin,
      If I use a diff lib to detect the encoding and use a Reader with that
      encoding, then pass it to Source, won't Source try to use the encoding
      specified for the Reader?
      Thx,
      Thushara
      On Jun 28, 2014 10:33 AM, "Martin Jericho" mjericho@users.sf.net wrote:

      • status: unread --> closed-rejected

      Status: closed-rejected
      Group: General
      Created: Fri Jun 27, 2014 09:29 PM UTC by thushara wijeratna
      Last Updated: Sat Jun 28, 2014 01:34 AM UTC
      Owner: nobody

      http://www.sonos.com/hometheater is not being decoded. It is quite
      possibly because the charset is incorrectly set:

      <meta http-equiv="Content-Type" content="text/html" charset="utf-16"/>

      But this seems to to a utf-8 page.

      The browser decodes the page fine, however. Is it possible for jericho to
      detect that the specified encoding is incorrect and try utf-8?


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/jerichohtml/bugs/78/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #78

      Attachments
  • Martin Jericho
    Martin Jericho
    2014-08-09

    No, there is nothing for the parser to decode if you pass it a Reader.