Menu

#18 HttpProcessor does not recognize charset in some cases

Backlog
open
nobody
None
5
2012-11-15
2009-07-20
Andrey
No

For this URL:

http://www.rainbowappliance.com/KGSK901SSS.html?utm_medium=shoppingengine

web-harvest returns http,charset=iso-8859-1, but this page contains following meta tag:

<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>

It looks like that regexp defined into HttpProcessor:

private static final String HTML_META_CHARSET_REGEX =
"(<meta\\s*http-equiv\\s*=\\s*(\\"|')content-type(\\"|')\\s*content\\s*=\\s*(\\"|')text html;\\s*charset\\s*="\\s*(.*?)(\\"|')/?">)";

not allows possibility of space character after the last quote symbol in the meta tag.

Discussion

  • Anonymous

    Anonymous - 2010-08-27

    Pretty strange, but it still does not work.
    It looks like it's a pretty simple change

    Instead of
    (<meta\\s*http-equiv\\s*=\\s*(\\"|')content-type(\\"|')\\s*content\\s*=\\s*(\\"|')text html;\\s*charset\\s*="\\s*(.*?)(\\"|')/?">)

    should be
    (<meta\\s*http-equiv\\s*=\\s*("|')content-type("|')\\s*content\\s*=\\s*("|')text html;\\s*charset\\s*="\\s*(.*?)("|')\\s*/?">)

    e.g. add \s* after the last quote symbol

    Could somebody please fix it?

     
  • Piotr Dyraga

    Piotr Dyraga - 2012-11-15
    • milestone: --> Backlog
     

Log in to post a comment.