Menu

Problem with quotation marks and attributes

2013-02-01
2013-02-01
  • Mark H. Butler

    Mark H. Butler - 2013-02-01

    Hi Martin!

    First thanks for Jericho, it is a very useful tool. Thank you very much.

    Jericho cannot parse broken HTML like this:

    <table style="WIDTH: 733px; height=; color: #666;"380" cellspacing="2" cellpadding="0" width="1019" align="center">
    

    because of the extra quote. I know this is horribly broken html ... but this type of error is surprisingly common because one tool (I am guessing Microsoft Word) uses

    &quot;
    

    to put quotes inside attributes e.g.

    <span style="mso-ascii-font-family: &quot;Times New Roman&quot;; mso-bidi-font-family: &quot;Times New Roman&quot;; mso-ansi-language: EN-US; mso-fareast-language: ZH-TW; mso-bidi-language: AR-SA" class="style1">
    

    so if the file is passed through another tool that blindly unescapes

    &quot;
    

    then consequently Jericho can't parse it and ignores large sections of the HTML document.

    Is there anyway to avoid this? In the example above a numerical character follows the quote, indicating an attribute problem, so the parser could then ignore everything up to the first >, then at least it would recover the table contents?

    Thanks,

    Mark

     

    Last edit: Mark H. Butler 2013-02-01
  • Martin Jericho

    Martin Jericho - 2013-02-01

    Hi Mark,

    What is this mystery tool that blindly unescapes character references in your source? Get rid of that and the problem goes away.

    If for some reason you have no control over it, you could increase the maximum number of attribute errors allowed before the parser rejects a tag, using the static configuration method Attributes.setDefaultMaxErrorCount(int)

    Cheers
    Martin

     
  • Mark H. Butler

    Mark H. Butler - 2013-02-01

    Hi Martin,

    I looked at the code in Attributes.construct() and the problem is the default error threshold is set quite low but if I increase it i.e.

    Attributes.setDefaultMaxErrorCount(15);

    then I can successfully parse the content.

    Thanks very much for this great library!

    Best wishes,

    Mark

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.