Menu

#94 Incorrectly parsing attribute values containing tags containing quotes

General
closed-rejected
None
5
2022-02-22
2022-02-20
No

Thank you for building a tool that can handle server-side tags! Here's code demonstrating an issue I found with server-side tags:

Source source = new Source("<SPAN id=\"test\" onkeyup=\"if(event.keyCode==32) run(<nested:write property=\"TEST\"/>);\" onmouseover=\"this.style.color='red';\"/></SPAN>");
List<Element> list = source.getAllElements();

This single tag should be parsed as having three attributes: id, onkeyup, and onmouseover, but jericho-html-3.4, is incorrectly parsing the second as "onkeyup=if(event.keyCode==32) run(<nested:write property=", and is parsing the thirds as TEST"/ (and is incorrectly parsing the StartTag as <SPAN id="test" onkeyup="if(event.keyCode==32) run(<nested:write property="TEST"/>

I think the problem is in Attributes.construct() where, at i=74, the IN_VALUE case interprets the quotation marks as ending the attribute, but those quotation marks are part of the inner (server-side) tag, not part of the outer tag.

My suggested solution is to replace this in Attributes.construct():

} else if (ch=='<' && quote==' ') {
               if (source.logger.isErrorEnabled()) log(source,logType,tagName,logBegin,"rejected because of '<' character in unquoted attribute value",i);
               return null;
}

with this:

~~~
} else if (ch=='<'){
if (quote==' ') {
if (source.logger.isErrorEnabled()) log(source,logType,tagName,logBegin,"rejected because of '<' character in unquoted attribute value",i);
return null;
} else {
final Tag tag=TagType.getTagAt(source,i,false,false);
if (tag!=null) {
i = tag.getEnd();
}
}
}
~~~
In other words, if a valid tag is found inside an attribute value, do not split it--skip to the end of the found tag.

Discussion

  • Martin Jericho

    Martin Jericho - 2022-02-20
    • status: unread --> closed-rejected
    • assigned_to: Martin Jericho
     
  • Martin Jericho

    Martin Jericho - 2022-02-20

    Hi Chris,

    The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases.

    If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first.

    For each server tag you find, call the Segment.ignoreWhenParsing() method on it. The parser will then be able to parse the surrounding HTML without problem. See the following documentation for more details:
    http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Segment.html#ignoreWhenParsing()

    I hope this helps you out.

    Cheers,
    Martin

     

    Last edit: Martin Jericho 2022-02-20
  • Chris Santos-Lang

    Thanks for the quick response! :)

    I can avoid the practice of using xml-style server tags inside HTML tags in my own code, but I am running into this difficulty when processing legacy code.

    Can I ask some follow-up questions?
    1. I would be happy to ignore all tags starting with <nested: or <logic: and I'd prefer to use Jericho to find them, rather than write my own parser to find them, but I am confused by the instruction that ignoreWhenParsing() must be called before fullSequentialParse() occurs. Does it mean I can't use this kind of thing?

    Tag tag = getNextStartTag(0, "nested:");
    while (tag != null){
        tag.getElement().ignoreWhenParsing();
        pos = tag.getBegin + 1
        Tag tag = getNextStartTag(pos, "nested:");
    }
    

    Are you recommending that I create two identical copies of the Source (one parsed and the other not) , and feed the begin and end positions from the parsed one to modify the unparsed before parsing it (or are there other examples of what you have in mind)?
    2. My tests seem to indicate that my solution works for the cases I have encountered so far (which is probably far less than what you have encountered, but nonetheless a potentially relevant range of cases). Just curious: What's the harm in offering an option to parse more cases correctly (even if you can't handle every case)?

    Thanks!

     
  • Martin Jericho

    Martin Jericho - 2022-02-22

    No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them.

    This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the Source object.

    In your case, all you need to do is the following:

    source.ignoreWhenParsing(source.getAllStartTags("nested:"));
    source.ignoreWhenParsing(source.getAllStartTags("logic:"));
    System.out.println(source.getFirstElement().getAttributeValue("onkeyup"));
    

    This should make the parser recognise the span tag as you desire, and the output (the value of the onkeyup attribute) should be:

    if(event.keyCode==32) run(<nested:write property="TEST"/>);

    To answer your second question, the harm in implementing a solution where the parser automatically ignores valid tags within an attribute value is that it results in incorrect results in many cases, including where the HTML is not meant to contain any server tags at all. It would be deliberately introducing bugs into the code.

    Take as an example the following valid HTML element:

    <p data-b="<c:d e=" data-f=">"></p>

    It has two attributes. The value of the data-b attribute is <c:d e=, and the value of the data-f attribute is>. If your suggested change were implemented, the parser would incorrectly report a single attribute data-b with value <c:d e=" data-f=">.

     
  • Chris Santos-Lang

    Thanks. I get it now :)

     

Log in to post a comment.