Problem with quotation marks and attributes

Brought to you by: mjericho

Problem with quotation marks and attributes

Forum: Open Discussion

Creator: Mark H. Butler

Created: 2013-02-01

Updated: 2013-02-01

Mark H. Butler - 2013-02-01

Hi Martin!

First thanks for Jericho, it is a very useful tool. Thank you very much.

Jericho cannot parse broken HTML like this:

<table style="WIDTH: 733px; height=; color: #666;"380" cellspacing="2" cellpadding="0" width="1019" align="center">

because of the extra quote. I know this is horribly broken html ... but this type of error is surprisingly common because one tool (I am guessing Microsoft Word) uses

"

to put quotes inside attributes e.g.

<span style="mso-ascii-font-family: "Times New Roman"; mso-bidi-font-family: "Times New Roman"; mso-ansi-language: EN-US; mso-fareast-language: ZH-TW; mso-bidi-language: AR-SA" class="style1">

so if the file is passed through another tool that blindly unescapes

"

then consequently Jericho can't parse it and ignores large sections of the HTML document.

Is there anyway to avoid this? In the example above a numerical character follows the quote, indicating an attribute problem, so the parser could then ignore everything up to the first >, then at least it would recover the table contents?

Thanks,

Mark

Last edit: Mark H. Butler 2013-02-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-02-01

Hi Mark,

What is this mystery tool that blindly unescapes character references in your source? Get rid of that and the problem goes away.

If for some reason you have no control over it, you could increase the maximum number of attribute errors allowed before the parser rejects a tag, using the static configuration method Attributes.setDefaultMaxErrorCount(int)

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark H. Butler - 2013-02-01

Hi Martin,

I looked at the code in Attributes.construct() and the problem is the default error threshold is set quite low but if I increase it i.e.

Attributes.setDefaultMaxErrorCount(15);

then I can successfully parse the content.

Thanks very much for this great library!

Best wishes,

Mark

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.