For this URL:
http://www.rainbowappliance.com/KGSK901SSS.html?utm_medium=shoppingengine
web-harvest returns http,charset=iso-8859-1, but this page contains following meta tag:
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
It looks like that regexp defined into HttpProcessor:
private static final String HTML_META_CHARSET_REGEX =
"(<meta\\s*http-equiv\\s*=\\s*(\\"|')content-type(\\"|')\\s*content\\s*=\\s*(\\"|')text html;\\s*charset\\s*="\\s*(.*?)(\\"|')/?">)";
not allows possibility of space character after the last quote symbol in the meta tag.
Pretty strange, but it still does not work.
It looks like it's a pretty simple change
Instead of
(<meta\\s*http-equiv\\s*=\\s*(\\"|')content-type(\\"|')\\s*content\\s*=\\s*(\\"|')text html;\\s*charset\\s*="\\s*(.*?)(\\"|')/?">)
should be
(<meta\\s*http-equiv\\s*=\\s*("|')content-type("|')\\s*content\\s*=\\s*("|')text html;\\s*charset\\s*="\\s*(.*?)("|')\\s*/?">)
e.g. add \s* after the last quote symbol
Could somebody please fix it?