[Htmlparser-user] latin1->utf8 problem?
Brought to you by:
derrickoswald
From: Eugeny N D. <bo...@re...> - 2006-07-28 21:30:11
|
Hello, I'm trying to parse page http://www.vu.lt/lt/naujienos/337/ but HtmlParser fails with this error: ERROR org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280) [junit] at org.htmlparser.lexer.Page.setEncoding(Page.java:865) [junit] at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150) [junit] at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) [junit] at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) [junit] at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92) [junit] at org.htmlparser.Parser.extractAllNodesThatMatch(Parser.java:768) at this line: Lexer lexer = new Lexer(new Page(document, encoding)); Parser parser = new Parser(lexer); ---->NodeList list = parser.extractAllNodesThatMatch(new InterestedTagsFilter());<---- I don't know the document encoding initially, and thus it's null. Could somebody please advice? -- Eugene N Dzhurinsky |