Jericho HTML Parser / Bugs / #94 Incorrectly parsing attribute values containing tags containing quotes

Martin Jericho - 2022-02-20

status: unread --> closed-rejected

assigned_to: Martin Jericho
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2022-02-20

Hi Chris,

The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases.

If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first.

For each server tag you find, call the Segment.ignoreWhenParsing() method on it. The parser will then be able to parse the surrounding HTML without problem. See the following documentation for more details:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Segment.html#ignoreWhenParsing()

I hope this helps you out.

Cheers,
Martin

Last edit: Martin Jericho 2022-02-20

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chris Santos-Lang - 2022-02-21

Thanks for the quick response! :)

I can avoid the practice of using xml-style server tags inside HTML tags in my own code, but I am running into this difficulty when processing legacy code.

Can I ask some follow-up questions?
1. I would be happy to ignore all tags starting with <nested: or <logic: and I'd prefer to use Jericho to find them, rather than write my own parser to find them, but I am confused by the instruction that ignoreWhenParsing() must be called before fullSequentialParse() occurs. Does it mean I can't use this kind of thing?

Tag tag = getNextStartTag(0, "nested:"); while (tag != null){ tag.getElement().ignoreWhenParsing(); pos = tag.getBegin + 1 Tag tag = getNextStartTag(pos, "nested:"); }

Are you recommending that I create two identical copies of the Source (one parsed and the other not) , and feed the begin and end positions from the parsed one to modify the unparsed before parsing it (or are there other examples of what you have in mind)?
2. My tests seem to indicate that my solution works for the cases I have encountered so far (which is probably far less than what you have encountered, but nonetheless a potentially relevant range of cases). Just curious: What's the harm in offering an option to parse more cases correctly (even if you can't handle every case)?

Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2022-02-22

No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them.

This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the Source object.

In your case, all you need to do is the following:

source.ignoreWhenParsing(source.getAllStartTags("nested:")); source.ignoreWhenParsing(source.getAllStartTags("logic:")); System.out.println(source.getFirstElement().getAttributeValue("onkeyup"));

This should make the parser recognise the span tag as you desire, and the output (the value of the onkeyup attribute) should be:

if(event.keyCode==32) run(<nested:write property="TEST"/>);

To answer your second question, the harm in implementing a solution where the parser automatically ignores valid tags within an attribute value is that it results in incorrect results in many cases, including where the HTML is not meant to contain any server tags at all. It would be deliberately introducing bugs into the code.

Take as an example the following valid HTML element:

<p data-b="<c:d e=" data-f=">"></p>

It has two attributes. The value of the data-b attribute is <c:d e=, and the value of the data-f attribute is>. If your suggested change were implemented, the parser would incorrectly report a single attribute data-b with value <c:d e=" data-f=">.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chris Santos-Lang - 2022-02-22

Thanks. I get it now :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Incorrectly parsing attribute values containing tags containing quotes

Group

Searches

Help

#94 Incorrectly parsing attribute values containing tags containing quotes

Discussion