Menu

#274 character mismatch

open-wont-fix
5
2009-11-26
2009-11-24
No

resuming read
read
read
read
org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0xfffd] != old: [0x97?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 16467
at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:279)
at org.htmlparser.lexer.Page.setEncoding(Page.java:864)
elements read:3
printing:

printing:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
printing:

at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:149)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:68)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:159)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:91)
at tester.test.main(test.java:61)
NYT elements read:0

Discussion

  • Thornton Martin

    Thornton Martin - 2009-11-24

    Test case showing bug

     
  • Derrick Oswald

    Derrick Oswald - 2009-11-24

    This looks outwardly like the standard use-case of a page with contents in the META tag differeing from the HTTP header response, see the FAQ:
    http://htmlparser.sourceforge.net/faq.html#encodingchangeexception

    If this isn't the case, please reopen the defect:
    Moving to Pending state.

     
  • Derrick Oswald

    Derrick Oswald - 2009-11-24
    • status: open --> pending-wont-fix
     
  • Thornton Martin

    Thornton Martin - 2009-11-26

    This NY Times page also causes trouble with your parser program:

    C:\TEMP\htmlparser1_6_20060610\htmlparser1_6\bin>.\parser.cmd http://www.nytimes
    .com/2007/10/25/technology/circuits/25basics.html?ex=1350964800&en=f86fab94086eb
    3bf&ei=5088&partner=rssnyt&emc=rss

    C:\TEMP\htmlparser1_6_20060610\htmlparser1_6\bin>D:\Sun\SDK\jdk\jre\bin\java.exe
    -classpath "C:\TEMP\htmlparser1_6_20060610\htmlparser1_6\lib\htmlparser.jar" or
    g.htmlparser.Parser http://www.nytimes.com/2007/10/25/technology/circuits/25basi
    cs.html?ex 1350964800
    org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0xfffd]
    != old: [0x97?]) for encoding change from ISO-8859-1 to UTF-8 at character off
    set 18187
    at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.
    java:280)
    at org.htmlparser.lexer.Page.setEncoding(Page.java:865)
    at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150)
    at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
    at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.
    java:160)
    at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
    at org.htmlparser.Parser.parse(Parser.java:701)
    at org.htmlparser.Parser.main(Parser.java:849)
    'en' is not recognized as an internal or external command,
    operable program or batch file.
    'ei' is not recognized as an internal or external command,
    operable program or batch file.
    'partner' is not recognized as an internal or external command,
    operable program or batch file.
    'emc' is not recognized as an internal or external command,
    operable program or batch file.

     
  • Thornton Martin

    Thornton Martin - 2009-11-26
    • status: pending-wont-fix --> open-wont-fix
     

Log in to post a comment.