#77 Malformed HTML not parsing correctly

General
closed-rejected
None
5
2014-05-28
2014-05-23
L P
No

I have the following HTML:

<img src="http://www.union.wisc.edu/emails/2014/images/wud_e_Revelry_14_0726.jpg"" alt="Revel!" width="1056" height="738" border="1"http://www.union.wisc.edu/emails/2014/images/hoofer_f_CouncilMeeting_14_0533.jpg border-color="CCCCCC"/>

But the parser is not able to pick up this tag and skips it as instance of StartTag. Here is the Jericho log:

2014-05-23 11:46:49 jericho [ERROR] StartTag img at (p0) has missing whitespace after quoted attribute value at position (p81)
2014-05-23 11:46:49 jericho [ERROR] StartTag img at (p0) contains attribute name with invalid first character at position (p81)
2014-05-23 11:46:49 jericho [ERROR] StartTag img at (p0) has missing whitespace after quoted attribute value at position (p83)
2014-05-23 11:46:49 jericho [ERROR] StartTag img at (p0) has missing whitespace after quoted attribute value at position (p132)
2014-05-23 11:46:49 jericho [ERROR] StartTag img at (p0) rejected because it contains too many errors
2014-05-23 11:46:49 jericho [ERROR] Encountered possible StartTag at (p0) whose content does not match a registered StartTagType

The problem is that the browser (FF) is still able to parse and render the image, whereas I am trying to pick off the src attribute and rewrite the URL.

Discussion

  • L P

    L P - 2014-05-24

    Another similar test case:

    <img class="newsletterLogo" style="border:0; width="304" height="26"
    height:auto; line-height:100%; outline:none; text-decoration:none;"
    src="http://images.military.com/media/mail/news/insider-logo-navy.jpg"/>

     
  • L P

    L P - 2014-05-28

    Ok, I think I have found a solution. If I set Attributes.setDefaultMaxErrorCount() to something higher than 2 (default), the parsing continues and I am able to get the attributes.

     
  • Martin Jericho

    Martin Jericho - 2014-05-28

    Sorry for not responding, email notifications weren't working for a few days. I'm glad you found the solution!

     
  • Martin Jericho

    Martin Jericho - 2014-05-28
    • status: unread --> closed-rejected
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks