I had an issue with an HTML where there was links in body such as:
<body>
<link rel="shortcut icon" href="/favicon.ico">
</body>
Notice that it is closed by '>' and not '/>'.
This messes up the parsing of the HTML completely.
I fixed this by updating the HtmlTokenizer like this:
In tagStart:
Keep the opened TagInfo in outer scope:
TagInfo tagInfo = null;
if (tagName != null) {
ITagInfoProvider tagInfoProvider = cleaner.getTagInfoProvider();
tagInfo = tagInfoProvider.getTagInfo(tagName);
if ( (tagInfo == null && !props.isOmitUnknownTags() && props.isTreatUnknownTagsAsContent() && !isReservedTag(tagName) && !props.isNamespacesAware()) ||
(tagInfo != null && tagInfo.isDeprecated() && !props.isOmitDeprecatedTags() && props.isTreatDeprecatedTagsAsContent()) ) {
content();
return;
}
}
Do not open a special context for empty tags, direclty inject an EndTagToken:
if ( isChar('>') ) {
go();
if (tagInfo == null || !tagInfo.isEmptyTag()) {
if (props.isUseCdataFor(tagName)) {
_isSpecialContext = true;
_isSpecialContextName = tagName;
}
} else {
addToken(new EndTagToken(tagName));
}
} else if ( startsWith("/>") ) {
go(2);
//
// If the tag is self-closing, add an end tag token here to avoid
// encapsulating the following content. See issue #93.
//
addToken(new EndTagToken(tagName));
}
Thanks Anthony,
I'll apply your patch and rerun the test cases.