Thank you for building a tool that can handle server-side tags! Here's code demonstrating an issue I found with server-side tags:
Source source = new Source("<SPAN id=\"test\" onkeyup=\"if(event.keyCode==32) run(<nested:write property=\"TEST\"/>);\" onmouseover=\"this.style.color='red';\"/></SPAN>");
List<Element> list = source.getAllElements();
This single tag should be parsed as having three attributes: id, onkeyup, and onmouseover, but jericho-html-3.4, is incorrectly parsing the second as "onkeyup=if(event.keyCode==32) run(<nested:write property=
", and is parsing the thirds as TEST"/
(and is incorrectly parsing the StartTag
as <SPAN id="test" onkeyup="if(event.keyCode==32) run(<nested:write property="TEST"/>
I think the problem is in Attributes.construct()
where, at i
=74, the IN_VALUE
case interprets the quotation marks as ending the attribute, but those quotation marks are part of the inner (server-side) tag, not part of the outer tag.
My suggested solution is to replace this in Attributes.construct()
:
} else if (ch=='<' && quote==' ') {
if (source.logger.isErrorEnabled()) log(source,logType,tagName,logBegin,"rejected because of '<' character in unquoted attribute value",i);
return null;
}
with this:
~~~
} else if (ch=='<'){
if (quote==' ') {
if (source.logger.isErrorEnabled()) log(source,logType,tagName,logBegin,"rejected because of '<' character in unquoted attribute value",i);
return null;
} else {
final Tag tag=TagType.getTagAt(source,i,false,false);
if (tag!=null) {
i = tag.getEnd();
}
}
}
~~~
In other words, if a valid tag is found inside an attribute value, do not split it--skip to the end of the found tag.
Hi Chris,
The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases.
If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first.
For each server tag you find, call the Segment.ignoreWhenParsing() method on it. The parser will then be able to parse the surrounding HTML without problem. See the following documentation for more details:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Segment.html#ignoreWhenParsing()
I hope this helps you out.
Cheers,
Martin
Last edit: Martin Jericho 2022-02-20
Thanks for the quick response! :)
I can avoid the practice of using xml-style server tags inside HTML tags in my own code, but I am running into this difficulty when processing legacy code.
Can I ask some follow-up questions?
1. I would be happy to ignore all tags starting with
<nested:
or<logic:
and I'd prefer to use Jericho to find them, rather than write my own parser to find them, but I am confused by the instruction thatignoreWhenParsing()
must be called beforefullSequentialParse()
occurs. Does it mean I can't use this kind of thing?Are you recommending that I create two identical copies of the
Source
(one parsed and the other not) , and feed the begin and end positions from the parsed one to modify the unparsed before parsing it (or are there other examples of what you have in mind)?2. My tests seem to indicate that my solution works for the cases I have encountered so far (which is probably far less than what you have encountered, but nonetheless a potentially relevant range of cases). Just curious: What's the harm in offering an option to parse more cases correctly (even if you can't handle every case)?
Thanks!
No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them.
This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the Source object.
In your case, all you need to do is the following:
This should make the parser recognise the span tag as you desire, and the output (the value of the onkeyup attribute) should be:
if(event.keyCode==32) run(<nested:write property="TEST"/>);
To answer your second question, the harm in implementing a solution where the parser automatically ignores valid tags within an attribute value is that it results in incorrect results in many cases, including where the HTML is not meant to contain any server tags at all. It would be deliberately introducing bugs into the code.
Take as an example the following valid HTML element:
<p data-b="<c:d e=" data-f=">"></p>
It has two attributes. The value of the data-b attribute is
<c:d e=
, and the value of the data-f attribute is>
. If your suggested change were implemented, the parser would incorrectly report a single attribute data-b with value<c:d e=" data-f=">
.Thanks. I get it now :)