Hello,
I have a problem parsing this html :
<html><a title="some text "some quoted text"" target="_blank"> some link </a></html>
I think the problem is the quoted text in the title attribute value is interpreted as a end quote then the next attribute target is misunterpreted and I end up with these errors:
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c34,p33)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c39,p38)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c46,p45)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) contains attribute name with invalid character at position (r1,c50,p49)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) contains attribute name with invalid character at position (r1,c51,p50)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) has missing whitespace after quoted attribute value at position (r1,c53,p52)
ERROR: net.htmlparser.jericho - StartTag a at (r1,c13,p12) rejected because it contains too many errors
ERROR: net.htmlparser.jericho - Encountered possible StartTag at (r1,c13,p12) whose content does not match a registered StartTagType
In my program, I juste do :
final Source source = new Source(html);
source.fullSequentialParse();
The errors happen juste after calling source.fullSequentialParse();
Is it a bug? or is there some specific method to call for this case?
Thanks
Hi Souad,
This is not a bug, just the parser properly reporting the syntax errors in the HTML. Quotes inside attribute values must be converted to character references such as ".
By default the parser is configured to give up trying to parse a tag if there are more than two minor syntactical errors. You can make it more tolerant by setting the static configuration property Attributes.setDefaultMaxErrorCount(5) or some higher number.
Increasing the tolerance in this way will ensure the tag is still recognised by the parser, although it will only be able to guess at what the attributes were meant to be. Browsers will also be guessing, and neither will properly interpret embedded quotes in attribute values.
Cheers
Martin
Thanks Martin, it did work !
I juste found by the way a discussion in the forum about the same problem : http://sourceforge.net/p/jerichohtml/discussion/350024/thread/3388b6e3/
I didn't saw it before (and google neither!) so sorry for opening this false bug. I will look more at the forum next time :)
cheers