because of the extra quote. I know this is horribly broken html ... but this type of error is surprisingly common because one tool (I am guessing Microsoft Word) uses
"
to put quotes inside attributes e.g.
<spanstyle="mso-ascii-font-family: "Times New Roman"; mso-bidi-font-family: "Times New Roman"; mso-ansi-language: EN-US; mso-fareast-language: ZH-TW; mso-bidi-language: AR-SA"class="style1">
so if the file is passed through another tool that blindly unescapes
"
then consequently Jericho can't parse it and ignores large sections of the HTML document.
Is there anyway to avoid this? In the example above a numerical character follows the quote, indicating an attribute problem, so the parser could then ignore everything up to the first >, then at least it would recover the table contents?
Thanks,
Mark
Last edit: Mark H. Butler 2013-02-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What is this mystery tool that blindly unescapes character references in your source? Get rid of that and the problem goes away.
If for some reason you have no control over it, you could increase the maximum number of attribute errors allowed before the parser rejects a tag, using the static configuration method Attributes.setDefaultMaxErrorCount(int)
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Martin!
First thanks for Jericho, it is a very useful tool. Thank you very much.
Jericho cannot parse broken HTML like this:
because of the extra quote. I know this is horribly broken html ... but this type of error is surprisingly common because one tool (I am guessing Microsoft Word) uses
to put quotes inside attributes e.g.
so if the file is passed through another tool that blindly unescapes
then consequently Jericho can't parse it and ignores large sections of the HTML document.
Is there anyway to avoid this? In the example above a numerical character follows the quote, indicating an attribute problem, so the parser could then ignore everything up to the first >, then at least it would recover the table contents?
Thanks,
Mark
Last edit: Mark H. Butler 2013-02-01
Hi Mark,
What is this mystery tool that blindly unescapes character references in your source? Get rid of that and the problem goes away.
If for some reason you have no control over it, you could increase the maximum number of attribute errors allowed before the parser rejects a tag, using the static configuration method Attributes.setDefaultMaxErrorCount(int)
Cheers
Martin
Hi Martin,
I looked at the code in Attributes.construct() and the problem is the default error threshold is set quite low but if I increase it i.e.
Attributes.setDefaultMaxErrorCount(15);
then I can successfully parse the content.
Thanks very much for this great library!
Best wishes,
Mark