Menu

#76 Double quoted text in attribute value causes parsing errors

General
closed-rejected
nobody
None
5
2013-12-12
2013-12-11
Souad
No

Hello,

I have a problem parsing this html :
<html><a title="some text "some quoted text"" target="_blank"> some link </a></html>

I think the problem is the quoted text in the title attribute value is interpreted as a end quote then the next attribute target is misunterpreted and I end up with these errors:

ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c34,p33)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c39,p38)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c46,p45)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) contains attribute name with invalid character at position (r1,c50,p49)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) contains attribute name with invalid character at position (r1,c51,p50)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c53,p52)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) rejected because it contains too many errors
ERROR: net.htmlparser.jericho - Encountered possible StartTag at (r1,c13,p12) whose content does not match a registered StartTagType

In my program, I juste do :
final Source source = new Source(html);
source.fullSequentialParse();

The errors happen juste after calling source.fullSequentialParse();

Is it a bug? or is there some specific method to call for this case?

Thanks

Discussion

  • Martin Jericho

    Martin Jericho - 2013-12-11

    Hi Souad,

    This is not a bug, just the parser properly reporting the syntax errors in the HTML. Quotes inside attribute values must be converted to character references such as ".

    By default the parser is configured to give up trying to parse a tag if there are more than two minor syntactical errors. You can make it more tolerant by setting the static configuration property Attributes.setDefaultMaxErrorCount(5) or some higher number.

    Increasing the tolerance in this way will ensure the tag is still recognised by the parser, although it will only be able to guess at what the attributes were meant to be. Browsers will also be guessing, and neither will properly interpret embedded quotes in attribute values.

    Cheers
    Martin

     
  • Martin Jericho

    Martin Jericho - 2013-12-11
    • status: unread --> closed-rejected
     
  • Souad

    Souad - 2013-12-12

    Thanks Martin, it did work !
    I juste found by the way a discussion in the forum about the same problem : http://sourceforge.net/p/jerichohtml/discussion/350024/thread/3388b6e3/
    I didn't saw it before (and google neither!) so sorry for opening this false bug. I will look more at the forum next time :)

    cheers

     

Log in to post a comment.

MongoDB Logo MongoDB